Statistics for High-Dimensional Data.pdf

发布时间：2022-06-08 发布人：admin 分类：说明书资料大小：7.27M 资料格式：pdf 举报版权申诉

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第1页.png

第1页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第2页.png

第2页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第3页.png

第3页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第4页.png

第4页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第5页.png

第5页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第6页.png

第6页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第7页.png

第7页 / 共568页

qhqhqhqhqhqhqh-8448891-4744300845376900620.pdf-第8页.png

第8页 / 共568页

GetFullPageImage.png

front-matter.pdf

Statistics for High-Dimensional Data

Preface

Contents

ch01.pdf

Chapter 1 Introduction

1.1 The framework

1.2 The possibilities and challenges

1.3 About the book

1.3.1 Organization of the book

1.4 Some examples

1.4.1 Prediction and biomarker discovery in genomics

1.4.1.1 Further biology applications treated in the book

ch02.pdf

Chapter 2 Lasso for linear models

2.1 Organization of the chapter

2.2 Introduction and preliminaries

2.2.1 The Lasso estimator

2.2.1.1 Estimation of the error variance

2.3 Orthonormal design

2.4 Prediction

2.4.1 Practical aspects about the Lasso for prediction

2.4.1.1 Binary classification of lymph node status using gene expressions

2.4.2 Some results from asymptotic theory

2.5 Variable screening and ||β^ - β0||-norms

2.6 Variable selection

2.6.1 Neighborhood stability and irrepresentable condition

2.7 Key properties and corresponding assumptions: a summary

2.8 The adaptive Lasso: a two-stage procedure

2.8.1 An illustration: simulated data and motif regression

2.8.2 Orthonormal design

2.8.3 The adaptive Lasso: variable selection under weak conditions

2.8.4 Computation

2.8.5 Multi-step adaptive Lasso

2.8.5.1 Motif regression in computational biology

2.8.6 Non-convex penalty functions

2.9 Thresholding the Lasso

2.10 The relaxed Lasso

2.11 Degrees of freedom of the Lasso

2.12 Path-following algorithms

2.12.1 Coordinatewise optimization and shooting algorithms

2.12.1.1 Active set strategy

2.13 Elastic net: an extension

Problems

ch03.pdf

Chapter 3 Generalized linear models and the Lasso

3.1 Organization of the chapter

3.2 Introduction and preliminaries

3.2.1 The Lasso estimator: penalizing the negative log-likelihood

3.3 Important examples of generalized linear models

3.3.1 Binary response variable and logistic regression

3.3.2 Poisson regression

3.3.3 Multi-category response variable and multinomial distribution

3.3.3.1 Contingency tables

Problems

ch04.pdf

Chapter 4 The group Lasso

4.1 Organization of the chapter

4.2 Introduction and preliminaries

4.2.1 The group Lasso penalty

4.3 Factor variables as covariates

4.3.1 Prediction of splice sites in DNA sequences

4.4 Properties of the group Lasso for generalized linear models

4.5 The generalized group Lasso penalty

4.5.1 Groupwise prediction penalty and parametrization invariance

4.6 The adaptive group Lasso

4.7 Algorithms for the group Lasso

4.7.1 Block coordinate descent

4.7.1.1 Squared error loss

4.7.1.2 Active set strategy

4.7.1.3 General convex loss

4.7.2 Block coordinate gradient descent

Problems

ch05.pdf

Chapter 5 Additive models and many smooth univariate functions

5.1 Organization of the chapter

5.2 Introduction and preliminaries

5.2.1 Penalized maximum likelihood for additive models

5.3 The sparsity-smoothness penalty

5.3.1 Orthogonal basis and diagonal smoothing matrices

5.3.2 Natural cubic splines and Sobolev spaces

5.3.3 Computation

5.3.3.1 Determining the zero estimates

5.3.3.2 Up-dates for the non-zero estimates

5.4 A sparsity-smoothness penalty of group Lasso type

5.4.1 Computational algorithm

5.4.2 Alternative approaches

5.5 Numerical examples

5.5.1 Simulated example

5.5.2 Motif regression

5.6 Prediction and variable selection

5.7 Generalized additive models

5.8 Linear model with varying coefficients

5.8.1 Properties for prediction

5.8.2 Multivariate linear model

5.9 Multitask learning

Problems

ch06.pdf

Chapter 6 Theory for the Lasso

6.1 Organization of this chapter

6.2 Least squares and the Lasso

6.2.1 Introduction

6.2.2 The result assuming the truth is linear

6.2.3 Linear approximation of the truth

6.2.4 A further refinement: handling smallish coefficients

6.3 The setup for general convex loss

6.4 The margin condition

6.5 Generalized linear model without penalty

6.6 Consistency of the Lasso for general loss

6.7 An oracle inequality

6.8 The lq-error for 1 < q < 2

6.9 The weighted Lasso

6.10 The adaptively weighted Lasso

6.11 Concave penalties

6.11.1 Sparsity oracle inequalities for least squares with

6.11.2 Proofs for this section (Section 6.11)

6.12 Compatibility and (random) matrices

6.13 On the compatibility condition

6.13.1 Direct bounds for the compatibility constant

6.13.2 Bounds using

6.13.3 Sets

6.13.4 Restricted isometry

6.13.5 Sparse eigenvalues

6.13.6 Further coherence notions

6.13.7 An overview of the various eigenvalue flavored constants

Problems

ch07.pdf

Chapter 7 Variable selection with the Lasso

7.1 Introduction

7.2 Some results from literature

7.3 Organization of this chapter

7.4 The beta-min condition

7.5 The irrepresentable condition in the noiseless case

7.5.1 Definition of the irrepresentable condition

7.5.2 The KKT conditions

7.5.3 Necessity and sufficiency for variable selection

7.5.4 The irrepresentable condition implies the compatibility condition

7.5.5 The irrepresentable condition and restricted regression

7.5.6 Selecting a superset of the true active set

7.5.7 The weighted irrepresentable condition

7.5.8 The weighted irrepresentable condition and restricted regression

7.5.9 The weighted Lasso with “ideal” weights

7.6 Definition of the adaptive and thresholded Lasso

7.6.1 Definition of adaptive Lasso

7.6.2 Definition of the thresholded Lasso

7.6.3 Order symbols

7.7 A recollection of the results obtained in Chapter 6

7.8 The adaptive Lasso and thresholding: invoking sparse eigenvalues

7.8.1 The conditions on the tuning parameters

7.8.2 The results

7.8.3 Comparison with the Lasso

7.8.4 Comparison between adaptive and thresholded Lasso

7.8.5 Bounds for the number of false negatives

7.8.6 Imposing beta-min conditions

7.9 The adaptive Lasso without invoking sparse eigenvalues

7.9.1 The condition on the tuning parameter

7.9.2 The results

7.10 Some concluding remarks

7.11 Technical complements for the noiseless case without sparse eigenvalues

7.11.1 Prediction error for the noiseless (weighted) Lasso

7.11.2 The number of false positives of the noiseless (weighted) Lasso

7.11.3 Thresholding the noiseless initial estimator

7.11.4 The noiseless adaptive Lasso

7.12 Technical complements for the noisy case without sparse eigenvalues

7.13 Selection with concave penalties

Problems

ch08.pdf

Chapter 8 Theory for l1/ l2 penalty procedures

8.1 Introduction

8.2 Organization and notation of this chapter

8.3 Regression with group structure

8.3.1 The loss function and penalty

8.3.2 The empirical process

8.3.3 The group Lasso compatibility condition

8.3.4 A group Lasso sparsity oracle inequality

8.3.5 Extensions

8.4 High-dimensional additive model

8.4.1 The loss function and penalty

8.4.2 The empirical process

8.4.2.1 Sobolev smoothness

8.4.2.2 Diagonalized smoothness

8.4.3 The smoothed Lasso compatibility condition

8.4.4 A smoothed group Lasso sparsity oracle inequality

8.4.5 On the choice of the penalty

8.4.5.1 Using only one of the two terms

8.4.5.2 Taking a single square root

8.4.5.3 Taking the squared instead on the non-squared smoothness norm

8.4.5.4 Taking a single square root and adding a squared smoothness norm

8.5 Linear model with time-varying coefficients

8.5.1 The loss function and penalty

8.5.2 The empirical process

8.5.3 The compatibility condition for the time-varying coefficients model

8.5.4 A sparsity oracle inequality for the time-varying coefficients model

8.6 Multivariate linear model and multitask learning

8.6.1 The loss function and penalty

8.6.2 The empirical process

8.6.3 The multitask compatibility condition

8.6.4 A multitask sparsity oracle inequality

8.7 The approximation condition for the smoothed group Lasso

8.7.1 Sobolev smoothness

8.7.2 Diagonalized smoothness

Problems

ch09.pdf

Chapter 9 Non-convex loss functions and l1regularization

9.1 Organization of the chapter

9.2 Finite mixture of regressions model

9.2.1 Finite mixture of Gaussian regressions model

9.2.1.1 Reparametrized mixture of regressions model

9.2.2 penalized maximum likelihood estimator

9.2.2.1 penalization for reparametrized linear models

9.2.2.2 penalized MLE for mixture of Gaussian regressions

9.2.3 Properties of the

9.2.4 Selection of the tuning parameters

9.2.5 Adaptive penalization

9.2.6 Riboflavin production with bacillus subtilis

9.2.7 Simulated example

9.2.8 Numerical optimization

9.2.9 GEM algorithm for optimization

9.2.9.1 Numerical Convergence of the BCD-GEM algorithm

9.2.10 Proof of Proposition 9.2

9.3 Linear mixed effects models

9.3.1 The model and penalized estimation

9.3.2 The Lasso in linear mixed effects models

9.3.3 Estimation of the random effects coefficients

9.3.4 Selection of the regularization parameter

9.3.5 Properties of the Lasso in linear mixed effects models

9.3.6 Adaptive penalized maximum likelihood estimator

9.3.7 Computational algorithm

9.3.8 Numerical results

9.3.8.1 Application: Riboflavin production data

9.4 Theory for l1 penalization with non-convex negative log-likelihood

9.4.1 The setting and notation

9.4.1.1 The margin

9.4.1.2 The empirical process

9.4.2 Oracle inequality for the Lasso for non-convex loss functions

9.4.3 Theory for finite mixture of regressions models

9.4.3.1 Oracle result for FMR models

9.4.4 Theory for linear mixed effects models

9.5 Proofs for Section 9.4

9.5.1 Proof of Lemma 9.1

9.5.2 Proof of Lemma 9.2

9.5.3 Proof of Theorem 9.1

9.5.4 Proof of Lemma 9.3

Problems

ch10.pdf

Chapter 10 Stable solutions

10.1 Organization of the chapter

10.2 Introduction, stability and subsampling

10.2.1 Stability paths for linear models

10.2.1.1 Riboflavin production with bacillus subtilis

10.2.1.2 Motif regression in computational biology

10.3 Stability selection

10.3.1 Choice of regularization and error control

10.3.1.1 Pointwise control

10.3.1.2 Examples of procedures choosing

10.3.1.3 The exchangeability condition and the generality of Theorem 10.1

10.4 Numerical results

10.5 Extensions

10.5.1 Randomized Lasso

10.6 Improvements from a theoretical perspective

10.7 Proofs

10.7.1 Sample splitting

10.7.2 Proof of Theorem 10.1

Problems

ch11.pdf

Chapter 11 P-values for linear models and beyond

11.1 Organization of the chapter

11.2 Introduction, sample splitting and high-dimensional variable selection

11.3 Multi sample splitting and familywise error control

11.3.1 Aggregation over multiple p-values

11.3.2 Control of familywise error

11.4 Multi sample splitting and false discovery rate

11.4.1 Control of false discovery rate

11.5 Numerical results

11.5.1 Simulations and familywise error control

11.5.1.1 Comparisons with adaptive Lasso

11.5.2 Familywise error control for motif regression in computational biology

11.5.3 Simulations and false discovery rate control

11.6 Consistent variable selection

11.6.1 Single sample split method

11.6.2 Multi sample split method

11.7 Extensions

11.7.1 Other models

11.7.2 Control of expected false positive selections

11.8 Proofs

11.8.1 Proof of Proposition 11.1

11.8.2 Proof of Theorem 11.1

11.8.3 Proof of Theorem 11.2

11.8.4 Proof of Proposition 11.2

11.8.5 Proof of Lemma 11.3

XTI

Problems

ch12.pdf

Chapter 12 Boosting and greedy algorithms

12.1 Organization of the chapter

12.2 Introduction and preliminaries

12.2.1 Ensemble methods: multiple prediction and aggregation

12.2.2 AdaBoost

12.3 Gradient boosting: a functional gradient descent algorithm

12.3.1 The generic FGD algorithm

12.3.1.1 Alternative formulation in function space

12.4 Some loss functions and boosting algorithms

12.4.1 Regression

12.4.2 Binary classification

12.4.3 Poisson regression

12.4.4 Two important boosting algorithms

12.4.4.1L2Boosting

12.4.4.2 BinomialBoosting: the FGD version of LogitBoost

12.4.5 Other data structures and models

12.5 Choosing the base procedure

12.5.1 Componentwise linear least squares for generalized linear models

12.5.2 Componentwise smoothing spline for additive models

12.5.3 Trees

12.5.4 The low-variance principle

12.5.5 Initialization of boosting

12.6 L2Boosting

12.6.1 Nonparametric curve estimation: some basic insights about boosting

12.6.1.1 L2Boosting using kernel estimators

12.6.2 for high-dimensional linear models

12.6.2.1 Connections to the Lasso

12.6.2.2 Asymptotic consistency in high dimensions

12.7 Forward selection and orthogonal matching pursuit

12.7.1 Linear models and squared error loss

12.7.1.1 Orthogonal matching pursuit

12.8 Proofs

12.8.1 Proof of Theorem 12.1

12.8.2 Proof of Theorem 12.2

12.8.3 Proof of Theorem 12.3

Problems

ch13.pdf

Chapter 13 Graphical modeling

13.1 Organization of the chapter

13.2 Preliminaries about graphical models

13.3 Undirected graphical models

13.3.1 Markov properties for undirected graphs

13.4 Gaussian graphical models

13.4.1 Penalized estimation for covariance matrix and edge set

13.4.1.1 Stability selection with the GLasso

13.4.2 Nodewise regression

13.4.3 Covariance estimation based on undirected graph

13.4.3.1 Gene expressions from isoprenoid biosynthesis pathway in arabidopsis thaliana

13.5 Ising model for binary random variables

13.6 Faithfulness assumption

13.6.1 Failure of faithfulness

13.6.1.1 Cancellation of regression coefficients

13.6.1.2 Moving-average model

13.6.2 Faithfulness and Gaussian graphical models

13.7 The PC-algorithm: an iterative estimation method

13.7.1 Population version of the PC-algorithm

13.7.2 Sample version for the PC-algorithm

13.7.2.1 Computational complexity

13.8 Consistency for high-dimensional data

13.8.1 An illustration

13.8.2 Theoretical analysis of the PC-algorithm

13.8.2.1 Analysis of partial correlations

13.8.2.2 Proof of Theorem 13.1

13.9 Back to linear models

13.9.1 Partial faithfulness

13.9.1.1 Sufficient condition for partial faithfulness

13.9.2 The PC-simple algorithm

13.9.2.1 Population version of the PC-simple algorithm

13.9.2.2 Sample version of the PC-simple algorithm

13.9.3 Numerical results

13.9.3.1 Real data: riboflavin production by bacillus subtilis

13.9.3.2 Simulated data

13.9.4 Asymptotic results in high dimensions

13.9.4.1 Consistency of the PC-simple algorithm

13.9.4.2 Discussion of the conditions of Theorem 13.3

13.9.5 Correlation screening (sure independence screening)

13.9.5.1 Asymptotic behavior of correlation screening

13.9.6 Proofs

13.9.6.1 Proof of Theorem 13.2

13.9.6.2 Proof of Proposition 13.6

13.9.6.3 Proof of Theorem 13.3

13.9.6.4 Proof of Theorem 13.4

Problems

ch14.pdf

Chapter 14 Probability and moment inequalities

14.1 Organization of this chapter

14.2 Some simple results for a single random variable

14.2.1 Sub-exponential random variables

14.2.2 Sub-Gaussian random variables

14.2.3 Jensen’s inequality for partly concave functions

14.3 Bernstein’s inequality

14.4 Hoeffding’s inequality

14.5 The maximum of averages

14.5.1 Using Bernstein’s inequality

14.5.2 Using Hoeffding’s inequality

14.5.3 Having sub-Gaussian random variables

14.6 Concentration inequalities

14.6.1 Bousquet’s inequality

14.6.2 Massart’s inequality

14.6.3 Sub-Gaussian random variables

14.7 Symmetrization and contraction

14.8 Concentration inequalities for Lipschitz loss functions

14.9 Concentration for squared error loss with random design

14.9.1 The inner product of noise and linear functions

14.9.2 Squared linear functions

14.9.3 Squared error loss

14.10 Assuming only lower order moments

14.10.1 Nemirovski moment inequality

14.10.2 A uniform inequality for quadratic forms

14.11 Using entropy for concentration in the sub-Gaussian case

14.12 Some entropy results

14.12.1 Entropy of finite-dimensional spaces and general convex hulls

14.12.2 Sets with restrictions on the coefficients

14.12.3 Convex hulls of small sets: entropy with log-term

14.12.4 Convex hulls of small sets: entropy without log-term

14.12.5 Further refinements

14.12.6 An example: functions with -th derivative of bounded variation

14.12.7 Proofs for this section (Section 14.12)

Problems

back-matter.pdf

Author Index

Index

References

Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger For other titles published in this series, go to http://www.springer.com/series/692

Peter Buhlmann • Sara van de Geer ¨ ¨ Statistics for High-Dimensional Data Methods, Theory and Applications

Peter Bühlmann Seminar for Statistics ETH Zürich CH-8092 Zürich Switzerland buhlmann@stat.math.ethz.ch Sara van de Geer Seminar for Statistics ETH Zürich CH-8092 Zürich Switzerland geer@stat.math.ethz.ch ISSN 0172-7397 ISBN 978-3-642-20191-2 DOI 10.1007/978-3-642-20192-9 Springer Heidelberg Dordrecht London New York e-ISBN 978-3-642-20192-9 Library of Congress Control Number: 2011930793 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

To Anthony and Sigi and Tabea, Anna, Sophia, Simon and Lukas

Preface High-dimensional data are nowadays rule rather than exception in areas like in- formation technology, bioinformatics or astronomy, to name just a few. The word “high-dimensional” refers to the situation where the number of unknown param- eters which are to be estimated is one or several orders of magnitude larger than the number of samples in the data. Classical statistical inference cannot be used for high-dimensional problems. For example, least-squares ﬁtting of a linear model hav- ing many more unknown parameters than observations and assigning corresponding standard errors and measures of signiﬁcance is ill-posed. It is rather obvious that without additional assumptions, or say restricting to a certain class of models, high- dimensional statistical inference is impossible. A well-established framework for ﬁt- ting many parameters is based on assuming structural smoothness, enabling estima- tion of smooth functions. The last years have witnessed a revolution of methodolog- ical, computational and mathematical advances which allow for high-dimensional statistical inference based on assuming certain notions of sparsity. Shifting the fo- cus from smoothness to sparsity constraints, or combining the two, opens the path for many more applications involving complex data. For example, the sparsity as- sumption that the health status of a person is depending only on a few among sev- eral thousands of biomarkers appears much more realistic than considering a model where all the thousands of variables would contribute in a smooth way to the state of health. This book brings together methodological concepts, computational algorithms, a few applications and mathematical theory for high-dimensional statistics. The math- ematical underpinning of methodology and computing has implications on explor- ing exciting possibilities and understanding fundamental limitations. In this sense, the combination of methodology and theory builds the foundation of the book. We present the methods and their potential for data analysis with a view on the un- derlying mathematical assumptions and properties and vice-versa, the theoretical derivations are motivated by applicability and implications to real data problems. The mathematical results yield additional insights and allow to categorize different methods and algorithms in terms of what they can achieve and what not. The book vii

分享到：

赞收藏

资料库

Statistics for High-Dimensional Data.pdf

相关推荐

课程资源

热门标签

最新资料