GetFullPageImage.png
front-matter.pdf
Statistics for High-Dimensional Data
Preface
Contents
ch01.pdf
Chapter 1 Introduction
1.1 The framework
1.2 The possibilities and challenges
1.3 About the book
1.3.1 Organization of the book
1.4 Some examples
1.4.1 Prediction and biomarker discovery in genomics
1.4.1.1 Further biology applications treated in the book
ch02.pdf
Chapter 2 Lasso for linear models
2.1 Organization of the chapter
2.2 Introduction and preliminaries
2.2.1 The Lasso estimator
2.2.1.1 Estimation of the error variance
2.3 Orthonormal design
2.4 Prediction
2.4.1 Practical aspects about the Lasso for prediction
2.4.1.1 Binary classification of lymph node status using gene expressions
2.4.2 Some results from asymptotic theory
2.5 Variable screening and ||β^ - β0||-norms
2.6 Variable selection
2.6.1 Neighborhood stability and irrepresentable condition
2.7 Key properties and corresponding assumptions: a summary
2.8 The adaptive Lasso: a two-stage procedure
2.8.1 An illustration: simulated data and motif regression
2.8.2 Orthonormal design
2.8.3 The adaptive Lasso: variable selection under weak conditions
2.8.4 Computation
2.8.5 Multi-step adaptive Lasso
2.8.5.1 Motif regression in computational biology
2.8.6 Non-convex penalty functions
2.9 Thresholding the Lasso
2.10 The relaxed Lasso
2.11 Degrees of freedom of the Lasso
2.12 Path-following algorithms
2.12.1 Coordinatewise optimization and shooting algorithms
2.12.1.1 Active set strategy
2.13 Elastic net: an extension
Problems
ch03.pdf
Chapter 3 Generalized linear models and the Lasso
3.1 Organization of the chapter
3.2 Introduction and preliminaries
3.2.1 The Lasso estimator: penalizing the negative log-likelihood
3.3 Important examples of generalized linear models
3.3.1 Binary response variable and logistic regression
3.3.2 Poisson regression
3.3.3 Multi-category response variable and multinomial distribution
3.3.3.1 Contingency tables
Problems
ch04.pdf
Chapter 4 The group Lasso
4.1 Organization of the chapter
4.2 Introduction and preliminaries
4.2.1 The group Lasso penalty
4.3 Factor variables as covariates
4.3.1 Prediction of splice sites in DNA sequences
4.4 Properties of the group Lasso for generalized linear models
4.5 The generalized group Lasso penalty
4.5.1 Groupwise prediction penalty and parametrization invariance
4.6 The adaptive group Lasso
4.7 Algorithms for the group Lasso
4.7.1 Block coordinate descent
4.7.1.1 Squared error loss
4.7.1.2 Active set strategy
4.7.1.3 General convex loss
4.7.2 Block coordinate gradient descent
Problems
ch05.pdf
Chapter 5 Additive models and many smooth univariate functions
5.1 Organization of the chapter
5.2 Introduction and preliminaries
5.2.1 Penalized maximum likelihood for additive models
5.3 The sparsity-smoothness penalty
5.3.1 Orthogonal basis and diagonal smoothing matrices
5.3.2 Natural cubic splines and Sobolev spaces
5.3.3 Computation
5.3.3.1 Determining the zero estimates
5.3.3.2 Up-dates for the non-zero estimates
5.4 A sparsity-smoothness penalty of group Lasso type
5.4.1 Computational algorithm
5.4.2 Alternative approaches
5.5 Numerical examples
5.5.1 Simulated example
5.5.2 Motif regression
5.6 Prediction and variable selection
5.7 Generalized additive models
5.8 Linear model with varying coefficients
5.8.1 Properties for prediction
5.8.2 Multivariate linear model
5.9 Multitask learning
Problems
ch06.pdf
Chapter 6 Theory for the Lasso
6.1 Organization of this chapter
6.2 Least squares and the Lasso
6.2.1 Introduction
6.2.2 The result assuming the truth is linear
6.2.3 Linear approximation of the truth
6.2.4 A further refinement: handling smallish coefficients
6.3 The setup for general convex loss
6.4 The margin condition
6.5 Generalized linear model without penalty
6.6 Consistency of the Lasso for general loss
6.7 An oracle inequality
6.8 The lq-error for 1 < q < 2
6.9 The weighted Lasso
6.10 The adaptively weighted Lasso
6.11 Concave penalties
6.11.1 Sparsity oracle inequalities for least squares with
6.11.2 Proofs for this section (Section 6.11)
6.12 Compatibility and (random) matrices
6.13 On the compatibility condition
6.13.1 Direct bounds for the compatibility constant
6.13.2 Bounds using
6.13.3 Sets
6.13.4 Restricted isometry
6.13.5 Sparse eigenvalues
6.13.6 Further coherence notions
6.13.7 An overview of the various eigenvalue flavored constants
Problems
ch07.pdf
Chapter 7 Variable selection with the Lasso
7.1 Introduction
7.2 Some results from literature
7.3 Organization of this chapter
7.4 The beta-min condition
7.5 The irrepresentable condition in the noiseless case
7.5.1 Definition of the irrepresentable condition
7.5.2 The KKT conditions
7.5.3 Necessity and sufficiency for variable selection
7.5.4 The irrepresentable condition implies the compatibility condition
7.5.5 The irrepresentable condition and restricted regression
7.5.6 Selecting a superset of the true active set
7.5.7 The weighted irrepresentable condition
7.5.8 The weighted irrepresentable condition and restricted regression
7.5.9 The weighted Lasso with “ideal” weights
7.6 Definition of the adaptive and thresholded Lasso
7.6.1 Definition of adaptive Lasso
7.6.2 Definition of the thresholded Lasso
7.6.3 Order symbols
7.7 A recollection of the results obtained in Chapter 6
7.8 The adaptive Lasso and thresholding: invoking sparse eigenvalues
7.8.1 The conditions on the tuning parameters
7.8.2 The results
7.8.3 Comparison with the Lasso
7.8.4 Comparison between adaptive and thresholded Lasso
7.8.5 Bounds for the number of false negatives
7.8.6 Imposing beta-min conditions
7.9 The adaptive Lasso without invoking sparse eigenvalues
7.9.1 The condition on the tuning parameter
7.9.2 The results
7.10 Some concluding remarks
7.11 Technical complements for the noiseless case without sparse eigenvalues
7.11.1 Prediction error for the noiseless (weighted) Lasso
7.11.2 The number of false positives of the noiseless (weighted) Lasso
7.11.3 Thresholding the noiseless initial estimator
7.11.4 The noiseless adaptive Lasso
7.12 Technical complements for the noisy case without sparse eigenvalues
7.13 Selection with concave penalties
Problems
ch08.pdf
Chapter 8 Theory for l1/ l2 penalty procedures
8.1 Introduction
8.2 Organization and notation of this chapter
8.3 Regression with group structure
8.3.1 The loss function and penalty
8.3.2 The empirical process
8.3.3 The group Lasso compatibility condition
8.3.4 A group Lasso sparsity oracle inequality
8.3.5 Extensions
8.4 High-dimensional additive model
8.4.1 The loss function and penalty
8.4.2 The empirical process
8.4.2.1 Sobolev smoothness
8.4.2.2 Diagonalized smoothness
8.4.3 The smoothed Lasso compatibility condition
8.4.4 A smoothed group Lasso sparsity oracle inequality
8.4.5 On the choice of the penalty
8.4.5.1 Using only one of the two terms
8.4.5.2 Taking a single square root
8.4.5.3 Taking the squared instead on the non-squared smoothness norm
8.4.5.4 Taking a single square root and adding a squared smoothness norm
8.5 Linear model with time-varying coefficients
8.5.1 The loss function and penalty
8.5.2 The empirical process
8.5.3 The compatibility condition for the time-varying coefficients model
8.5.4 A sparsity oracle inequality for the time-varying coefficients model
8.6 Multivariate linear model and multitask learning
8.6.1 The loss function and penalty
8.6.2 The empirical process
8.6.3 The multitask compatibility condition
8.6.4 A multitask sparsity oracle inequality
8.7 The approximation condition for the smoothed group Lasso
8.7.1 Sobolev smoothness
8.7.2 Diagonalized smoothness
Problems
ch09.pdf
Chapter 9 Non-convex loss functions and l1regularization
9.1 Organization of the chapter
9.2 Finite mixture of regressions model
9.2.1 Finite mixture of Gaussian regressions model
9.2.1.1 Reparametrized mixture of regressions model
9.2.2 penalized maximum likelihood estimator
9.2.2.1 penalization for reparametrized linear models
9.2.2.2 penalized MLE for mixture of Gaussian regressions
9.2.3 Properties of the
9.2.4 Selection of the tuning parameters
9.2.5 Adaptive penalization
9.2.6 Riboflavin production with bacillus subtilis
9.2.7 Simulated example
9.2.8 Numerical optimization
9.2.9 GEM algorithm for optimization
9.2.9.1 Numerical Convergence of the BCD-GEM algorithm
9.2.10 Proof of Proposition 9.2
9.3 Linear mixed effects models
9.3.1 The model and penalized estimation
9.3.2 The Lasso in linear mixed effects models
9.3.3 Estimation of the random effects coefficients
9.3.4 Selection of the regularization parameter
9.3.5 Properties of the Lasso in linear mixed effects models
9.3.6 Adaptive penalized maximum likelihood estimator
9.3.7 Computational algorithm
9.3.8 Numerical results
9.3.8.1 Application: Riboflavin production data
9.4 Theory for l1 penalization with non-convex negative log-likelihood
9.4.1 The setting and notation
9.4.1.1 The margin
9.4.1.2 The empirical process
9.4.2 Oracle inequality for the Lasso for non-convex loss functions
9.4.3 Theory for finite mixture of regressions models
9.4.3.1 Oracle result for FMR models
9.4.4 Theory for linear mixed effects models
9.5 Proofs for Section 9.4
9.5.1 Proof of Lemma 9.1
9.5.2 Proof of Lemma 9.2
9.5.3 Proof of Theorem 9.1
9.5.4 Proof of Lemma 9.3
Problems
ch10.pdf
Chapter 10 Stable solutions
10.1 Organization of the chapter
10.2 Introduction, stability and subsampling
10.2.1 Stability paths for linear models
10.2.1.1 Riboflavin production with bacillus subtilis
10.2.1.2 Motif regression in computational biology
10.3 Stability selection
10.3.1 Choice of regularization and error control
10.3.1.1 Pointwise control
10.3.1.2 Examples of procedures choosing
10.3.1.3 The exchangeability condition and the generality of Theorem 10.1
10.4 Numerical results
10.5 Extensions
10.5.1 Randomized Lasso
10.6 Improvements from a theoretical perspective
10.7 Proofs
10.7.1 Sample splitting
10.7.2 Proof of Theorem 10.1
Problems
ch11.pdf
Chapter 11 P-values for linear models and beyond
11.1 Organization of the chapter
11.2 Introduction, sample splitting and high-dimensional variable selection
11.3 Multi sample splitting and familywise error control
11.3.1 Aggregation over multiple p-values
11.3.2 Control of familywise error
11.4 Multi sample splitting and false discovery rate
11.4.1 Control of false discovery rate
11.5 Numerical results
11.5.1 Simulations and familywise error control
11.5.1.1 Comparisons with adaptive Lasso
11.5.2 Familywise error control for motif regression in computational biology
11.5.3 Simulations and false discovery rate control
11.6 Consistent variable selection
11.6.1 Single sample split method
11.6.2 Multi sample split method
11.7 Extensions
11.7.1 Other models
11.7.2 Control of expected false positive selections
11.8 Proofs
11.8.1 Proof of Proposition 11.1
11.8.2 Proof of Theorem 11.1
11.8.3 Proof of Theorem 11.2
11.8.4 Proof of Proposition 11.2
11.8.5 Proof of Lemma 11.3
XTI
Problems
ch12.pdf
Chapter 12 Boosting and greedy algorithms
12.1 Organization of the chapter
12.2 Introduction and preliminaries
12.2.1 Ensemble methods: multiple prediction and aggregation
12.2.2 AdaBoost
12.3 Gradient boosting: a functional gradient descent algorithm
12.3.1 The generic FGD algorithm
12.3.1.1 Alternative formulation in function space
12.4 Some loss functions and boosting algorithms
12.4.1 Regression
12.4.2 Binary classification
12.4.3 Poisson regression
12.4.4 Two important boosting algorithms
12.4.4.1L2Boosting
12.4.4.2 BinomialBoosting: the FGD version of LogitBoost
12.4.5 Other data structures and models
12.5 Choosing the base procedure
12.5.1 Componentwise linear least squares for generalized linear models
12.5.2 Componentwise smoothing spline for additive models
12.5.3 Trees
12.5.4 The low-variance principle
12.5.5 Initialization of boosting
12.6 L2Boosting
12.6.1 Nonparametric curve estimation: some basic insights about boosting
12.6.1.1 L2Boosting using kernel estimators
12.6.2 for high-dimensional linear models
12.6.2.1 Connections to the Lasso
12.6.2.2 Asymptotic consistency in high dimensions
12.7 Forward selection and orthogonal matching pursuit
12.7.1 Linear models and squared error loss
12.7.1.1 Orthogonal matching pursuit
12.8 Proofs
12.8.1 Proof of Theorem 12.1
12.8.2 Proof of Theorem 12.2
12.8.3 Proof of Theorem 12.3
Problems
ch13.pdf
Chapter 13 Graphical modeling
13.1 Organization of the chapter
13.2 Preliminaries about graphical models
13.3 Undirected graphical models
13.3.1 Markov properties for undirected graphs
13.4 Gaussian graphical models
13.4.1 Penalized estimation for covariance matrix and edge set
13.4.1.1 Stability selection with the GLasso
13.4.2 Nodewise regression
13.4.3 Covariance estimation based on undirected graph
13.4.3.1 Gene expressions from isoprenoid biosynthesis pathway in arabidopsis thaliana
13.5 Ising model for binary random variables
13.6 Faithfulness assumption
13.6.1 Failure of faithfulness
13.6.1.1 Cancellation of regression coefficients
13.6.1.2 Moving-average model
13.6.2 Faithfulness and Gaussian graphical models
13.7 The PC-algorithm: an iterative estimation method
13.7.1 Population version of the PC-algorithm
13.7.2 Sample version for the PC-algorithm
13.7.2.1 Computational complexity
13.8 Consistency for high-dimensional data
13.8.1 An illustration
13.8.2 Theoretical analysis of the PC-algorithm
13.8.2.1 Analysis of partial correlations
13.8.2.2 Proof of Theorem 13.1
13.9 Back to linear models
13.9.1 Partial faithfulness
13.9.1.1 Sufficient condition for partial faithfulness
13.9.2 The PC-simple algorithm
13.9.2.1 Population version of the PC-simple algorithm
13.9.2.2 Sample version of the PC-simple algorithm
13.9.3 Numerical results
13.9.3.1 Real data: riboflavin production by bacillus subtilis
13.9.3.2 Simulated data
13.9.4 Asymptotic results in high dimensions
13.9.4.1 Consistency of the PC-simple algorithm
13.9.4.2 Discussion of the conditions of Theorem 13.3
13.9.5 Correlation screening (sure independence screening)
13.9.5.1 Asymptotic behavior of correlation screening
13.9.6 Proofs
13.9.6.1 Proof of Theorem 13.2
13.9.6.2 Proof of Proposition 13.6
13.9.6.3 Proof of Theorem 13.3
13.9.6.4 Proof of Theorem 13.4
Problems
ch14.pdf
Chapter 14 Probability and moment inequalities
14.1 Organization of this chapter
14.2 Some simple results for a single random variable
14.2.1 Sub-exponential random variables
14.2.2 Sub-Gaussian random variables
14.2.3 Jensen’s inequality for partly concave functions
14.3 Bernstein’s inequality
14.4 Hoeffding’s inequality
14.5 The maximum of averages
14.5.1 Using Bernstein’s inequality
14.5.2 Using Hoeffding’s inequality
14.5.3 Having sub-Gaussian random variables
14.6 Concentration inequalities
14.6.1 Bousquet’s inequality
14.6.2 Massart’s inequality
14.6.3 Sub-Gaussian random variables
14.7 Symmetrization and contraction
14.8 Concentration inequalities for Lipschitz loss functions
14.9 Concentration for squared error loss with random design
14.9.1 The inner product of noise and linear functions
14.9.2 Squared linear functions
14.9.3 Squared error loss
14.10 Assuming only lower order moments
14.10.1 Nemirovski moment inequality
14.10.2 A uniform inequality for quadratic forms
14.11 Using entropy for concentration in the sub-Gaussian case
14.12 Some entropy results
14.12.1 Entropy of finite-dimensional spaces and general convex hulls
14.12.2 Sets with restrictions on the coefficients
14.12.3 Convex hulls of small sets: entropy with log-term
14.12.4 Convex hulls of small sets: entropy without log-term
14.12.5 Further refinements
14.12.6 An example: functions with -th derivative of bounded variation
14.12.7 Proofs for this section (Section 14.12)
Problems
back-matter.pdf
Author Index
Index
References