High Dimensional Statistics
Lecture Notes
(This version: June 1, 2018)
Philippe Rigollet and Jan-Christian H¨utter
Spring 2017
Preface
These lecture notes were written for the course 18.657, High Dimensional
Statistics at MIT. They build on a set of notes that was prepared at Prince-
ton University in 2013-14 that was modified (and hopefully improved) over the
years.
Over the past decade, statistics have undergone drastic changes with the
development of high-dimensional statistical inference.
Indeed, on each indi-
vidual, more and more features are measured to a point that their number
usually far exceeds the number of observations. This is the case in biology and
specifically genetics where millions of (combinations of) genes are measured
for a single individual. High resolution imaging, finance, online advertising,
climate studies . . . the list of intensive data producing fields is too long to be
established exhaustively. Clearly not all measured features are relevant for a
given task and most of them are simply noise. But which ones? What can be
done with so little data and so much noise? Surprisingly, the situation is not
that bad and on some simple models we can assess to which extent meaningful
statistical methods can be applied. Regression is one such simple model.
Regression analysis can be traced back to 1632 when Galileo Galilei used
a procedure to infer a linear relationship from noisy data.
It was not until
the early 19th century that Gauss and Legendre developed a systematic pro-
cedure: the least-squares method. Since then, regression has been studied
in so many forms that much insight has been gained and recent advances on
high-dimensional statistics would not have been possible without standing on
the shoulders of giants. In these notes, we will explore one, obviously subjec-
tive giant on whose shoulders high-dimensional statistics stand: nonparametric
statistics.
The works of Ibragimov and Has’minskii in the seventies followed by many
researchers from the Russian school have contributed to developing a large
toolkit to understand regression with an infinite number of parameters. Much
insight from this work can be gained to understand high-dimensional or sparse
regression and it comes as no surprise that Donoho and Johnstone have made
the first contributions on this topic in the early nineties.
Therefore, while not obviously connected to high dimensional statistics, we
will talk about nonparametric estimation. I borrowed this disclaimer (and the
template) from my colleague Ramon van Handel. It does apply here.
I have no illusions about the state of these notes—they were written
i
Preface
ii
rather quickly, sometimes at the rate of a chapter a week. I have
no doubt that many errors remain in the text; at the very least
many of the proofs are extremely compact, and should be made a
little clearer as is befitting of a pedagogical (?) treatment. If I have
another opportunity to teach such a course, I will go over the notes
again in detail and attempt the necessary modifications. For the
time being, however, the notes are available as-is.
As any good set of notes, they should be perpetually improved and updated
but a two or three year horizon is more realistic. Therefore, if you have any
comments, questions, suggestions, omissions, and of course mistakes, please let
me know. I can be contacted by e-mail at rigollet@math.mit.edu.
Acknowledgements. These notes were improved thanks to the careful read-
ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall, Gau-
tam Kamath, Hengrui Luo, Kevin Lin, Ali Makhdoumi, Yaroslav Mukhin, Lud-
wig Schmidt, Vira Semenova, Yuyan Wang, Jonathan Weed, Chiyuan Zhang
and Jianhao Zhang.
These notes were written under the partial support of the National Science
Foundation, CAREER award DMS-1053987.
Required background.
I assume that the reader has had basic courses in
probability and mathematical statistics. Some elementary background in anal-
ysis and measure theory is helpful but not required. Some basic notions of
linear algebra, especially spectral decomposition of matrices is required for the
later chapters.
Since the first version of these notes was posted a couple of manuscripts on
high-dimensional probability by Ramon van Handel [vH17] and Roman Ver-
shynin [Ver18] were published. Both are of outstanding quality—much higher
than the present notes—and very related to this material. I strongly recom-
mend the reader to learn about this fascinating topic in parallel with high-
dimensional statistics.
Notation
i |xi|q 1
q for q > 0
0 norm of x defined to be the number of nonzero coordinates of x
k-th derivative of f
j-th vector of the canonical basis
complement of set A
Convex hull of set S.
an ≤ Cbn for a numerical constant C > 0
symmetric group on n elements
Functions, sets, vectors
Set of integers [n] = {1, . . . , n}
Unit sphere in dimension d
Indicator function
q norm of x defined by |x|q =
[n]
S d−1
1I(· )
|x|q
|x|0
f (k)
ej
Ac
conv(S)
an bn
Sn
Matrices
Ip
Identity matrix of IRp
d
d
trace of a square matrix A
Symmetric matrices in IRd×d
Symmetric positive semi-definite matrices in IRd×d
Symmetric positive definite matrices in IRd×d
Order relation given by B − A ∈ S +
Order relation given by B − A ∈ S ++
Moore-Penrose pseudoinverse of M
Tr(A)
Sd
S +
S ++
A B
A ≺ B
M†
∇xf (x)
Gradient of f at x
∇xf (x)|x=x0 Gradient of f at x0
Distributions
iii
Preface
iv
N (µ, σ2)
Nd(µ, Σ)
subG(σ2)
subGd(σ2)
subE(σ2)
Ber(p)
Bin(n, p)
Lap(λ)
PX
Univariate Gaussian distribution with mean µ ∈ IR and variance σ2 > 0
d-variate distribution with mean µ ∈ IRd and covariance matrix Σ ∈ IRd×d
Univariate sub-Gaussian distributions with variance proxy σ2 > 0
d-variate sub-Gaussian distributions with variance proxy σ2 > 0
sub-Exponential distributions with variance proxy σ2 > 0
Bernoulli distribution with parameter p ∈ [0, 1]
Binomial distribution with parameters n ≥ 1, p ∈ [0, 1]
Double exponential (or Laplace) distribution with parameter λ > 0
Marginal distribution of X
Function spaces
W (β, L)
Θ(β, Q)
Sobolev class of functions
Sobolev ellipsoid of 2(IN)
Contents
Preface
Notation
Contents
Introduction
1 Sub-Gaussian Random Variables
1.1 Gaussian tails and MGF . . . . . . . . . . . . . . . . . . . . . .
1.2 Sub-Gaussian random variables and Chernoff bounds . . . . . .
1.3 Sub-exponential random variables . . . . . . . . . . . . . . . . .
1.4 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Sums of independent random matrices . . . . . . . . . . . . . .
1.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Linear Regression Model
2.1 Fixed design linear regression . . . . . . . . . . . . . . . . . . .
2.2 Least squares estimators . . . . . . . . . . . . . . . . . . . . . .
2.3 The Gaussian Sequence Model
. . . . . . . . . . . . . . . . . .
2.4 High-dimensional linear regression . . . . . . . . . . . . . . . .
2.5 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Misspecified Linear Models
3.1 Oracle inequalities
. . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Nonparametric regression . . . . . . . . . . . . . . . . . . . . .
3.3 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
iii
v
1
14
14
16
22
25
30
35
39
39
41
48
53
68
70
71
79
90
4 Minimax Lower Bounds
91
4.1 Optimality in a minimax sense . . . . . . . . . . . . . . . . . .
91
. . . . . . . . . . . . . .
4.2 Reduction to finite hypothesis testing
93
4.3 Lower bounds based on two hypotheses
. . . . . . . . . . . . .
94
4.4 Lower bounds based on many hypotheses
. . . . . . . . . . . . 100
4.5 Application to the Gaussian sequence model . . . . . . . . . . . 103
4.6 Lower bounds for sparse estimation via χ2 divergence
. . . . . 108
v
Contents
vi
4.7 Lower bounds for estimating the 1 norm via moment matching 111
4.8 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5 Matrix estimation
120
5.1 Basic facts about matrices . . . . . . . . . . . . . . . . . . . . . 120
5.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . 123
5.3 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . 131
5.4 Principal component analysis . . . . . . . . . . . . . . . . . . . 133
5.5 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Bibliography
156
Introduction
This course is mainly about learning a regression function from a collection
of observations.
In this chapter, after defining this task formally, we give
an overview of the course and the questions around regression. We adopt
the statistical learning point of view where the task of prediction prevails.
Nevertheless many interesting questions will remain unanswered when the last
page comes: testing, model selection, implementation,. . .
REGRESSION ANALYSIS AND PREDICTION RISK
Model and definitions
Let (X, Y ) ∈ X ×Y where X is called feature and lives in a topological space X
and Y ∈ Y ⊂ IR is called response or sometimes label when Y is a discrete set,
e.g., Y = {0, 1}. Often X ⊂ IRd, in which case X is called vector of covariates
or simply covariate. Our goal will be to predict Y given X and for our problem
to be meaningful, we need Y to depend non-trivially on X. Our task would be
done if we had access to the conditional distribution of Y given X. This is the
world of the probabilist. The statistician does not have access to this valuable
information but rather has to estimate it, at least partially. The regression
function gives a simple summary of this conditional distribution, namely, the
conditional expectation.
Formally, the regression function of Y onto X is defined by:
f (x) = IE[Y |X = x] ,
x ∈ X .
As we will see, it arises naturally in the context of prediction.
Best prediction and prediction risk
Suppose for a moment that you know the conditional distribution of Y given
X. Given the realization of X = x, your goal is to predict the realization of
Y . Intuitively, we would like to find a measurable1 function g : X → Y such
that g(X) is close to Y , in other words, such that |Y − g(X)| is small. But
|Y − g(X)| is a random variable, so it not clear what “small” means in this
context. A somewhat arbitrary answer can be given by declaring a random
1all topological spaces are equipped with their Borel σ-algebra
1