logo资料库

Statistics for High-Dimensional Data.pdf

第1页 / 共568页
第2页 / 共568页
第3页 / 共568页
第4页 / 共568页
第5页 / 共568页
第6页 / 共568页
第7页 / 共568页
第8页 / 共568页
资料共568页,剩余部分请下载后查看
GetFullPageImage.png
front-matter.pdf
Statistics for High-Dimensional Data
Preface
Contents
ch01.pdf
Chapter 1 Introduction
1.1 The framework
1.2 The possibilities and challenges
1.3 About the book
1.3.1 Organization of the book
1.4 Some examples
1.4.1 Prediction and biomarker discovery in genomics
1.4.1.1 Further biology applications treated in the book
ch02.pdf
Chapter 2 Lasso for linear models
2.1 Organization of the chapter
2.2 Introduction and preliminaries
2.2.1 The Lasso estimator
2.2.1.1 Estimation of the error variance
2.3 Orthonormal design
2.4 Prediction
2.4.1 Practical aspects about the Lasso for prediction
2.4.1.1 Binary classification of lymph node status using gene expressions
2.4.2 Some results from asymptotic theory
2.5 Variable screening and ||β^ - β0||-norms
2.6 Variable selection
2.6.1 Neighborhood stability and irrepresentable condition
2.7 Key properties and corresponding assumptions: a summary
2.8 The adaptive Lasso: a two-stage procedure
2.8.1 An illustration: simulated data and motif regression
2.8.2 Orthonormal design
2.8.3 The adaptive Lasso: variable selection under weak conditions
2.8.4 Computation
2.8.5 Multi-step adaptive Lasso
2.8.5.1 Motif regression in computational biology
2.8.6 Non-convex penalty functions
2.9 Thresholding the Lasso
2.10 The relaxed Lasso
2.11 Degrees of freedom of the Lasso
2.12 Path-following algorithms
2.12.1 Coordinatewise optimization and shooting algorithms
2.12.1.1 Active set strategy
2.13 Elastic net: an extension
Problems
ch03.pdf
Chapter 3 Generalized linear models and the Lasso
3.1 Organization of the chapter
3.2 Introduction and preliminaries
3.2.1 The Lasso estimator: penalizing the negative log-likelihood
3.3 Important examples of generalized linear models
3.3.1 Binary response variable and logistic regression
3.3.2 Poisson regression
3.3.3 Multi-category response variable and multinomial distribution
3.3.3.1 Contingency tables
Problems
ch04.pdf
Chapter 4 The group Lasso
4.1 Organization of the chapter
4.2 Introduction and preliminaries
4.2.1 The group Lasso penalty
4.3 Factor variables as covariates
4.3.1 Prediction of splice sites in DNA sequences
4.4 Properties of the group Lasso for generalized linear models
4.5 The generalized group Lasso penalty
4.5.1 Groupwise prediction penalty and parametrization invariance
4.6 The adaptive group Lasso
4.7 Algorithms for the group Lasso
4.7.1 Block coordinate descent
4.7.1.1 Squared error loss
4.7.1.2 Active set strategy
4.7.1.3 General convex loss
4.7.2 Block coordinate gradient descent
Problems
ch05.pdf
Chapter 5 Additive models and many smooth univariate functions
5.1 Organization of the chapter
5.2 Introduction and preliminaries
5.2.1 Penalized maximum likelihood for additive models
5.3 The sparsity-smoothness penalty
5.3.1 Orthogonal basis and diagonal smoothing matrices
5.3.2 Natural cubic splines and Sobolev spaces
5.3.3 Computation
5.3.3.1 Determining the zero estimates
5.3.3.2 Up-dates for the non-zero estimates
5.4 A sparsity-smoothness penalty of group Lasso type
5.4.1 Computational algorithm
5.4.2 Alternative approaches
5.5 Numerical examples
5.5.1 Simulated example
5.5.2 Motif regression
5.6 Prediction and variable selection
5.7 Generalized additive models
5.8 Linear model with varying coefficients
5.8.1 Properties for prediction
5.8.2 Multivariate linear model
5.9 Multitask learning
Problems
ch06.pdf
Chapter 6 Theory for the Lasso
6.1 Organization of this chapter
6.2 Least squares and the Lasso
6.2.1 Introduction
6.2.2 The result assuming the truth is linear
6.2.3 Linear approximation of the truth
6.2.4 A further refinement: handling smallish coefficients
6.3 The setup for general convex loss
6.4 The margin condition
6.5 Generalized linear model without penalty
6.6 Consistency of the Lasso for general loss
6.7 An oracle inequality
6.8 The lq-error for 1 < q < 2
6.9 The weighted Lasso
6.10 The adaptively weighted Lasso
6.11 Concave penalties
6.11.1 Sparsity oracle inequalities for least squares with
6.11.2 Proofs for this section (Section 6.11)
6.12 Compatibility and (random) matrices
6.13 On the compatibility condition
6.13.1 Direct bounds for the compatibility constant
6.13.2 Bounds using
6.13.3 Sets
6.13.4 Restricted isometry
6.13.5 Sparse eigenvalues
6.13.6 Further coherence notions
6.13.7 An overview of the various eigenvalue flavored constants
Problems
ch07.pdf
Chapter 7 Variable selection with the Lasso
7.1 Introduction
7.2 Some results from literature
7.3 Organization of this chapter
7.4 The beta-min condition
7.5 The irrepresentable condition in the noiseless case
7.5.1 Definition of the irrepresentable condition
7.5.2 The KKT conditions
7.5.3 Necessity and sufficiency for variable selection
7.5.4 The irrepresentable condition implies the compatibility condition
7.5.5 The irrepresentable condition and restricted regression
7.5.6 Selecting a superset of the true active set
7.5.7 The weighted irrepresentable condition
7.5.8 The weighted irrepresentable condition and restricted regression
7.5.9 The weighted Lasso with “ideal” weights
7.6 Definition of the adaptive and thresholded Lasso
7.6.1 Definition of adaptive Lasso
7.6.2 Definition of the thresholded Lasso
7.6.3 Order symbols
7.7 A recollection of the results obtained in Chapter 6
7.8 The adaptive Lasso and thresholding: invoking sparse eigenvalues
7.8.1 The conditions on the tuning parameters
7.8.2 The results
7.8.3 Comparison with the Lasso
7.8.4 Comparison between adaptive and thresholded Lasso
7.8.5 Bounds for the number of false negatives
7.8.6 Imposing beta-min conditions
7.9 The adaptive Lasso without invoking sparse eigenvalues
7.9.1 The condition on the tuning parameter
7.9.2 The results
7.10 Some concluding remarks
7.11 Technical complements for the noiseless case without sparse eigenvalues
7.11.1 Prediction error for the noiseless (weighted) Lasso
7.11.2 The number of false positives of the noiseless (weighted) Lasso
7.11.3 Thresholding the noiseless initial estimator
7.11.4 The noiseless adaptive Lasso
7.12 Technical complements for the noisy case without sparse eigenvalues
7.13 Selection with concave penalties
Problems
ch08.pdf
Chapter 8 Theory for l1/ l2 penalty procedures
8.1 Introduction
8.2 Organization and notation of this chapter
8.3 Regression with group structure
8.3.1 The loss function and penalty
8.3.2 The empirical process
8.3.3 The group Lasso compatibility condition
8.3.4 A group Lasso sparsity oracle inequality
8.3.5 Extensions
8.4 High-dimensional additive model
8.4.1 The loss function and penalty
8.4.2 The empirical process
8.4.2.1 Sobolev smoothness
8.4.2.2 Diagonalized smoothness
8.4.3 The smoothed Lasso compatibility condition
8.4.4 A smoothed group Lasso sparsity oracle inequality
8.4.5 On the choice of the penalty
8.4.5.1 Using only one of the two terms
8.4.5.2 Taking a single square root
8.4.5.3 Taking the squared instead on the non-squared smoothness norm
8.4.5.4 Taking a single square root and adding a squared smoothness norm
8.5 Linear model with time-varying coefficients
8.5.1 The loss function and penalty
8.5.2 The empirical process
8.5.3 The compatibility condition for the time-varying coefficients model
8.5.4 A sparsity oracle inequality for the time-varying coefficients model
8.6 Multivariate linear model and multitask learning
8.6.1 The loss function and penalty
8.6.2 The empirical process
8.6.3 The multitask compatibility condition
8.6.4 A multitask sparsity oracle inequality
8.7 The approximation condition for the smoothed group Lasso
8.7.1 Sobolev smoothness
8.7.2 Diagonalized smoothness
Problems
ch09.pdf
Chapter 9 Non-convex loss functions and l1regularization
9.1 Organization of the chapter
9.2 Finite mixture of regressions model
9.2.1 Finite mixture of Gaussian regressions model
9.2.1.1 Reparametrized mixture of regressions model
9.2.2 penalized maximum likelihood estimator
9.2.2.1 penalization for reparametrized linear models
9.2.2.2 penalized MLE for mixture of Gaussian regressions
9.2.3 Properties of the
9.2.4 Selection of the tuning parameters
9.2.5 Adaptive penalization
9.2.6 Riboflavin production with bacillus subtilis
9.2.7 Simulated example
9.2.8 Numerical optimization
9.2.9 GEM algorithm for optimization
9.2.9.1 Numerical Convergence of the BCD-GEM algorithm
9.2.10 Proof of Proposition 9.2
9.3 Linear mixed effects models
9.3.1 The model and penalized estimation
9.3.2 The Lasso in linear mixed effects models
9.3.3 Estimation of the random effects coefficients
9.3.4 Selection of the regularization parameter
9.3.5 Properties of the Lasso in linear mixed effects models
9.3.6 Adaptive penalized maximum likelihood estimator
9.3.7 Computational algorithm
9.3.8 Numerical results
9.3.8.1 Application: Riboflavin production data
9.4 Theory for l1 penalization with non-convex negative log-likelihood
9.4.1 The setting and notation
9.4.1.1 The margin
9.4.1.2 The empirical process
9.4.2 Oracle inequality for the Lasso for non-convex loss functions
9.4.3 Theory for finite mixture of regressions models
9.4.3.1 Oracle result for FMR models
9.4.4 Theory for linear mixed effects models
9.5 Proofs for Section 9.4
9.5.1 Proof of Lemma 9.1
9.5.2 Proof of Lemma 9.2
9.5.3 Proof of Theorem 9.1
9.5.4 Proof of Lemma 9.3
Problems
ch10.pdf
Chapter 10 Stable solutions
10.1 Organization of the chapter
10.2 Introduction, stability and subsampling
10.2.1 Stability paths for linear models
10.2.1.1 Riboflavin production with bacillus subtilis
10.2.1.2 Motif regression in computational biology
10.3 Stability selection
10.3.1 Choice of regularization and error control
10.3.1.1 Pointwise control
10.3.1.2 Examples of procedures choosing
10.3.1.3 The exchangeability condition and the generality of Theorem 10.1
10.4 Numerical results
10.5 Extensions
10.5.1 Randomized Lasso
10.6 Improvements from a theoretical perspective
10.7 Proofs
10.7.1 Sample splitting
10.7.2 Proof of Theorem 10.1
Problems
ch11.pdf
Chapter 11 P-values for linear models and beyond
11.1 Organization of the chapter
11.2 Introduction, sample splitting and high-dimensional variable selection
11.3 Multi sample splitting and familywise error control
11.3.1 Aggregation over multiple p-values
11.3.2 Control of familywise error
11.4 Multi sample splitting and false discovery rate
11.4.1 Control of false discovery rate
11.5 Numerical results
11.5.1 Simulations and familywise error control
11.5.1.1 Comparisons with adaptive Lasso
11.5.2 Familywise error control for motif regression in computational biology
11.5.3 Simulations and false discovery rate control
11.6 Consistent variable selection
11.6.1 Single sample split method
11.6.2 Multi sample split method
11.7 Extensions
11.7.1 Other models
11.7.2 Control of expected false positive selections
11.8 Proofs
11.8.1 Proof of Proposition 11.1
11.8.2 Proof of Theorem 11.1
11.8.3 Proof of Theorem 11.2
11.8.4 Proof of Proposition 11.2
11.8.5 Proof of Lemma 11.3
XTI
Problems
ch12.pdf
Chapter 12 Boosting and greedy algorithms
12.1 Organization of the chapter
12.2 Introduction and preliminaries
12.2.1 Ensemble methods: multiple prediction and aggregation
12.2.2 AdaBoost
12.3 Gradient boosting: a functional gradient descent algorithm
12.3.1 The generic FGD algorithm
12.3.1.1 Alternative formulation in function space
12.4 Some loss functions and boosting algorithms
12.4.1 Regression
12.4.2 Binary classification
12.4.3 Poisson regression
12.4.4 Two important boosting algorithms
12.4.4.1L2Boosting
12.4.4.2 BinomialBoosting: the FGD version of LogitBoost
12.4.5 Other data structures and models
12.5 Choosing the base procedure
12.5.1 Componentwise linear least squares for generalized linear models
12.5.2 Componentwise smoothing spline for additive models
12.5.3 Trees
12.5.4 The low-variance principle
12.5.5 Initialization of boosting
12.6 L2Boosting
12.6.1 Nonparametric curve estimation: some basic insights about boosting
12.6.1.1 L2Boosting using kernel estimators
12.6.2 for high-dimensional linear models
12.6.2.1 Connections to the Lasso
12.6.2.2 Asymptotic consistency in high dimensions
12.7 Forward selection and orthogonal matching pursuit
12.7.1 Linear models and squared error loss
12.7.1.1 Orthogonal matching pursuit
12.8 Proofs
12.8.1 Proof of Theorem 12.1
12.8.2 Proof of Theorem 12.2
12.8.3 Proof of Theorem 12.3
Problems
ch13.pdf
Chapter 13 Graphical modeling
13.1 Organization of the chapter
13.2 Preliminaries about graphical models
13.3 Undirected graphical models
13.3.1 Markov properties for undirected graphs
13.4 Gaussian graphical models
13.4.1 Penalized estimation for covariance matrix and edge set
13.4.1.1 Stability selection with the GLasso
13.4.2 Nodewise regression
13.4.3 Covariance estimation based on undirected graph
13.4.3.1 Gene expressions from isoprenoid biosynthesis pathway in arabidopsis thaliana
13.5 Ising model for binary random variables
13.6 Faithfulness assumption
13.6.1 Failure of faithfulness
13.6.1.1 Cancellation of regression coefficients
13.6.1.2 Moving-average model
13.6.2 Faithfulness and Gaussian graphical models
13.7 The PC-algorithm: an iterative estimation method
13.7.1 Population version of the PC-algorithm
13.7.2 Sample version for the PC-algorithm
13.7.2.1 Computational complexity
13.8 Consistency for high-dimensional data
13.8.1 An illustration
13.8.2 Theoretical analysis of the PC-algorithm
13.8.2.1 Analysis of partial correlations
13.8.2.2 Proof of Theorem 13.1
13.9 Back to linear models
13.9.1 Partial faithfulness
13.9.1.1 Sufficient condition for partial faithfulness
13.9.2 The PC-simple algorithm
13.9.2.1 Population version of the PC-simple algorithm
13.9.2.2 Sample version of the PC-simple algorithm
13.9.3 Numerical results
13.9.3.1 Real data: riboflavin production by bacillus subtilis
13.9.3.2 Simulated data
13.9.4 Asymptotic results in high dimensions
13.9.4.1 Consistency of the PC-simple algorithm
13.9.4.2 Discussion of the conditions of Theorem 13.3
13.9.5 Correlation screening (sure independence screening)
13.9.5.1 Asymptotic behavior of correlation screening
13.9.6 Proofs
13.9.6.1 Proof of Theorem 13.2
13.9.6.2 Proof of Proposition 13.6
13.9.6.3 Proof of Theorem 13.3
13.9.6.4 Proof of Theorem 13.4
Problems
ch14.pdf
Chapter 14 Probability and moment inequalities
14.1 Organization of this chapter
14.2 Some simple results for a single random variable
14.2.1 Sub-exponential random variables
14.2.2 Sub-Gaussian random variables
14.2.3 Jensen’s inequality for partly concave functions
14.3 Bernstein’s inequality
14.4 Hoeffding’s inequality
14.5 The maximum of averages
14.5.1 Using Bernstein’s inequality
14.5.2 Using Hoeffding’s inequality
14.5.3 Having sub-Gaussian random variables
14.6 Concentration inequalities
14.6.1 Bousquet’s inequality
14.6.2 Massart’s inequality
14.6.3 Sub-Gaussian random variables
14.7 Symmetrization and contraction
14.8 Concentration inequalities for Lipschitz loss functions
14.9 Concentration for squared error loss with random design
14.9.1 The inner product of noise and linear functions
14.9.2 Squared linear functions
14.9.3 Squared error loss
14.10 Assuming only lower order moments
14.10.1 Nemirovski moment inequality
14.10.2 A uniform inequality for quadratic forms
14.11 Using entropy for concentration in the sub-Gaussian case
14.12 Some entropy results
14.12.1 Entropy of finite-dimensional spaces and general convex hulls
14.12.2 Sets with restrictions on the coefficients
14.12.3 Convex hulls of small sets: entropy with log-term
14.12.4 Convex hulls of small sets: entropy without log-term
14.12.5 Further refinements
14.12.6 An example: functions with -th derivative of bounded variation
14.12.7 Proofs for this section (Section 14.12)
Problems
back-matter.pdf
Author Index
Index
References
Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger For other titles published in this series, go to http://www.springer.com/series/692
Peter Buhlmann • Sara van de Geer ¨ ¨ Statistics for High-Dimensional Data Methods, Theory and Applications
Peter Bühlmann Seminar for Statistics ETH Zürich CH-8092 Zürich Switzerland buhlmann@stat.math.ethz.ch Sara van de Geer Seminar for Statistics ETH Zürich CH-8092 Zürich Switzerland geer@stat.math.ethz.ch ISSN 0172-7397 ISBN 978-3-642-20191-2 DOI 10.1007/978-3-642-20192-9 Springer Heidelberg Dordrecht London New York e-ISBN 978-3-642-20192-9 Library of Congress Control Number: 2011930793 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: deblik, Berlin Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To Anthony and Sigi and Tabea, Anna, Sophia, Simon and Lukas
Preface High-dimensional data are nowadays rule rather than exception in areas like in- formation technology, bioinformatics or astronomy, to name just a few. The word “high-dimensional” refers to the situation where the number of unknown param- eters which are to be estimated is one or several orders of magnitude larger than the number of samples in the data. Classical statistical inference cannot be used for high-dimensional problems. For example, least-squares fitting of a linear model hav- ing many more unknown parameters than observations and assigning corresponding standard errors and measures of significance is ill-posed. It is rather obvious that without additional assumptions, or say restricting to a certain class of models, high- dimensional statistical inference is impossible. A well-established framework for fit- ting many parameters is based on assuming structural smoothness, enabling estima- tion of smooth functions. The last years have witnessed a revolution of methodolog- ical, computational and mathematical advances which allow for high-dimensional statistical inference based on assuming certain notions of sparsity. Shifting the fo- cus from smoothness to sparsity constraints, or combining the two, opens the path for many more applications involving complex data. For example, the sparsity as- sumption that the health status of a person is depending only on a few among sev- eral thousands of biomarkers appears much more realistic than considering a model where all the thousands of variables would contribute in a smooth way to the state of health. This book brings together methodological concepts, computational algorithms, a few applications and mathematical theory for high-dimensional statistics. The math- ematical underpinning of methodology and computing has implications on explor- ing exciting possibilities and understanding fundamental limitations. In this sense, the combination of methodology and theory builds the foundation of the book. We present the methods and their potential for data analysis with a view on the un- derlying mathematical assumptions and properties and vice-versa, the theoretical derivations are motivated by applicability and implications to real data problems. The mathematical results yield additional insights and allow to categorize different methods and algorithms in terms of what they can achieve and what not. The book vii
分享到:
收藏