logo资料库

high dimensional statistics.pdf

第1页 / 共167页
第2页 / 共167页
第3页 / 共167页
第4页 / 共167页
第5页 / 共167页
第6页 / 共167页
第7页 / 共167页
第8页 / 共167页
资料共167页,剩余部分请下载后查看
Preface
Notation
Contents
Introduction
Sub-Gaussian Random Variables
Gaussian tails and MGF
Sub-Gaussian random variables and Chernoff bounds
Sub-exponential random variables
Maximal inequalities
Sums of independent random matrices
Problem set
Linear Regression Model
Fixed design linear regression
Least squares estimators
The Gaussian Sequence Model
High-dimensional linear regression
Problem set
Misspecified Linear Models
Oracle inequalities
Nonparametric regression
Problem Set
Minimax Lower Bounds
Optimality in a minimax sense
Reduction to finite hypothesis testing
Lower bounds based on two hypotheses
Lower bounds based on many hypotheses
Application to the Gaussian sequence model
Lower bounds for sparse estimation via 2 divergence
Lower bounds for estimating the 1 norm via moment matching
Problem Set
Matrix estimation
Basic facts about matrices
Multivariate regression
Covariance matrix estimation
Principal component analysis
Graphical models
Problem set
Bibliography
High Dimensional Statistics Lecture Notes (This version: June 1, 2018) Philippe Rigollet and Jan-Christian H¨utter Spring 2017
Preface These lecture notes were written for the course 18.657, High Dimensional Statistics at MIT. They build on a set of notes that was prepared at Prince- ton University in 2013-14 that was modified (and hopefully improved) over the years. Over the past decade, statistics have undergone drastic changes with the development of high-dimensional statistical inference. Indeed, on each indi- vidual, more and more features are measured to a point that their number usually far exceeds the number of observations. This is the case in biology and specifically genetics where millions of (combinations of) genes are measured for a single individual. High resolution imaging, finance, online advertising, climate studies . . . the list of intensive data producing fields is too long to be established exhaustively. Clearly not all measured features are relevant for a given task and most of them are simply noise. But which ones? What can be done with so little data and so much noise? Surprisingly, the situation is not that bad and on some simple models we can assess to which extent meaningful statistical methods can be applied. Regression is one such simple model. Regression analysis can be traced back to 1632 when Galileo Galilei used a procedure to infer a linear relationship from noisy data. It was not until the early 19th century that Gauss and Legendre developed a systematic pro- cedure: the least-squares method. Since then, regression has been studied in so many forms that much insight has been gained and recent advances on high-dimensional statistics would not have been possible without standing on the shoulders of giants. In these notes, we will explore one, obviously subjec- tive giant on whose shoulders high-dimensional statistics stand: nonparametric statistics. The works of Ibragimov and Has’minskii in the seventies followed by many researchers from the Russian school have contributed to developing a large toolkit to understand regression with an infinite number of parameters. Much insight from this work can be gained to understand high-dimensional or sparse regression and it comes as no surprise that Donoho and Johnstone have made the first contributions on this topic in the early nineties. Therefore, while not obviously connected to high dimensional statistics, we will talk about nonparametric estimation. I borrowed this disclaimer (and the template) from my colleague Ramon van Handel. It does apply here. I have no illusions about the state of these notes—they were written i
Preface ii rather quickly, sometimes at the rate of a chapter a week. I have no doubt that many errors remain in the text; at the very least many of the proofs are extremely compact, and should be made a little clearer as is befitting of a pedagogical (?) treatment. If I have another opportunity to teach such a course, I will go over the notes again in detail and attempt the necessary modifications. For the time being, however, the notes are available as-is. As any good set of notes, they should be perpetually improved and updated but a two or three year horizon is more realistic. Therefore, if you have any comments, questions, suggestions, omissions, and of course mistakes, please let me know. I can be contacted by e-mail at rigollet@math.mit.edu. Acknowledgements. These notes were improved thanks to the careful read- ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall, Gau- tam Kamath, Hengrui Luo, Kevin Lin, Ali Makhdoumi, Yaroslav Mukhin, Lud- wig Schmidt, Vira Semenova, Yuyan Wang, Jonathan Weed, Chiyuan Zhang and Jianhao Zhang. These notes were written under the partial support of the National Science Foundation, CAREER award DMS-1053987. Required background. I assume that the reader has had basic courses in probability and mathematical statistics. Some elementary background in anal- ysis and measure theory is helpful but not required. Some basic notions of linear algebra, especially spectral decomposition of matrices is required for the later chapters. Since the first version of these notes was posted a couple of manuscripts on high-dimensional probability by Ramon van Handel [vH17] and Roman Ver- shynin [Ver18] were published. Both are of outstanding quality—much higher than the present notes—and very related to this material. I strongly recom- mend the reader to learn about this fascinating topic in parallel with high- dimensional statistics.
Notation i |xi|q 1 q for q > 0 0 norm of x defined to be the number of nonzero coordinates of x k-th derivative of f j-th vector of the canonical basis complement of set A Convex hull of set S. an ≤ Cbn for a numerical constant C > 0 symmetric group on n elements Functions, sets, vectors Set of integers [n] = {1, . . . , n} Unit sphere in dimension d Indicator function q norm of x defined by |x|q = [n] S d−1 1I(· ) |x|q |x|0 f (k) ej Ac conv(S) an bn Sn Matrices Ip Identity matrix of IRp d d trace of a square matrix A Symmetric matrices in IRd×d Symmetric positive semi-definite matrices in IRd×d Symmetric positive definite matrices in IRd×d Order relation given by B − A ∈ S + Order relation given by B − A ∈ S ++ Moore-Penrose pseudoinverse of M Tr(A) Sd S + S ++ A B A ≺ B M† ∇xf (x) Gradient of f at x ∇xf (x)|x=x0 Gradient of f at x0 Distributions iii
Preface iv N (µ, σ2) Nd(µ, Σ) subG(σ2) subGd(σ2) subE(σ2) Ber(p) Bin(n, p) Lap(λ) PX Univariate Gaussian distribution with mean µ ∈ IR and variance σ2 > 0 d-variate distribution with mean µ ∈ IRd and covariance matrix Σ ∈ IRd×d Univariate sub-Gaussian distributions with variance proxy σ2 > 0 d-variate sub-Gaussian distributions with variance proxy σ2 > 0 sub-Exponential distributions with variance proxy σ2 > 0 Bernoulli distribution with parameter p ∈ [0, 1] Binomial distribution with parameters n ≥ 1, p ∈ [0, 1] Double exponential (or Laplace) distribution with parameter λ > 0 Marginal distribution of X Function spaces W (β, L) Θ(β, Q) Sobolev class of functions Sobolev ellipsoid of 2(IN)
Contents Preface Notation Contents Introduction 1 Sub-Gaussian Random Variables 1.1 Gaussian tails and MGF . . . . . . . . . . . . . . . . . . . . . . 1.2 Sub-Gaussian random variables and Chernoff bounds . . . . . . 1.3 Sub-exponential random variables . . . . . . . . . . . . . . . . . 1.4 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Sums of independent random matrices . . . . . . . . . . . . . . 1.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Linear Regression Model 2.1 Fixed design linear regression . . . . . . . . . . . . . . . . . . . 2.2 Least squares estimators . . . . . . . . . . . . . . . . . . . . . . 2.3 The Gaussian Sequence Model . . . . . . . . . . . . . . . . . . 2.4 High-dimensional linear regression . . . . . . . . . . . . . . . . 2.5 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Misspecified Linear Models 3.1 Oracle inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . 3.3 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i iii v 1 14 14 16 22 25 30 35 39 39 41 48 53 68 70 71 79 90 4 Minimax Lower Bounds 91 4.1 Optimality in a minimax sense . . . . . . . . . . . . . . . . . . 91 . . . . . . . . . . . . . . 4.2 Reduction to finite hypothesis testing 93 4.3 Lower bounds based on two hypotheses . . . . . . . . . . . . . 94 4.4 Lower bounds based on many hypotheses . . . . . . . . . . . . 100 4.5 Application to the Gaussian sequence model . . . . . . . . . . . 103 4.6 Lower bounds for sparse estimation via χ2 divergence . . . . . 108 v
Contents vi 4.7 Lower bounds for estimating the 1 norm via moment matching 111 4.8 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5 Matrix estimation 120 5.1 Basic facts about matrices . . . . . . . . . . . . . . . . . . . . . 120 5.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . 123 5.3 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . 131 5.4 Principal component analysis . . . . . . . . . . . . . . . . . . . 133 5.5 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Bibliography 156
Introduction This course is mainly about learning a regression function from a collection of observations. In this chapter, after defining this task formally, we give an overview of the course and the questions around regression. We adopt the statistical learning point of view where the task of prediction prevails. Nevertheless many interesting questions will remain unanswered when the last page comes: testing, model selection, implementation,. . . REGRESSION ANALYSIS AND PREDICTION RISK Model and definitions Let (X, Y ) ∈ X ×Y where X is called feature and lives in a topological space X and Y ∈ Y ⊂ IR is called response or sometimes label when Y is a discrete set, e.g., Y = {0, 1}. Often X ⊂ IRd, in which case X is called vector of covariates or simply covariate. Our goal will be to predict Y given X and for our problem to be meaningful, we need Y to depend non-trivially on X. Our task would be done if we had access to the conditional distribution of Y given X. This is the world of the probabilist. The statistician does not have access to this valuable information but rather has to estimate it, at least partially. The regression function gives a simple summary of this conditional distribution, namely, the conditional expectation. Formally, the regression function of Y onto X is defined by: f (x) = IE[Y |X = x] , x ∈ X . As we will see, it arises naturally in the context of prediction. Best prediction and prediction risk Suppose for a moment that you know the conditional distribution of Y given X. Given the realization of X = x, your goal is to predict the realization of Y . Intuitively, we would like to find a measurable1 function g : X → Y such that g(X) is close to Y , in other words, such that |Y − g(X)| is small. But |Y − g(X)| is a random variable, so it not clear what “small” means in this context. A somewhat arbitrary answer can be given by declaring a random 1all topological spaces are equipped with their Borel σ-algebra 1
分享到:
收藏