A Solution Manual and Notes for:
The Elements of Statistical Learning
by Jerome Friedman, Trevor Hastie,
and Robert Tibshirani
John L. Weatherwax∗
David Epstein†
15 July 2017
Introduction
The Elements of Statistical Learning is an influential and widely studied book in the fields
of machine learning, statistical inference, and pattern recognition. It is a standard recom-
mended text in many graduate courses on these topics. It is also very challenging, particularly
if one faces it without the support of teachers who are expert in the subject matter. For
various reasons, both authors of these notes felt the need to understand the book well, and
therefore to produce notes on the text when we found the text difficult at first reading,
and answers to the exercises. Gaining understanding is time-consuming and intellectually
demanding, so it seemed sensible to record our efforts in LaTeX, and make it available on
the web to other readers. A valuable by-product of writing up mathematical material, is
that often one finds gaps and errors in what one has written.
Now it is well-known in all branches of learning, but in particular in mathematical learning,
that the way to learn is to do, rather than to read. It is all too common to read through
some material, convince oneself that one understands it well, but then find oneself at sea
when trying to apply it in even a slightly different situation. Moreover, material understood
only at a shallow level is easily forgotten.
It is therefore our strong recommendation that readers of the book should not look at our
responses to any of the exercises before making a substantial effort to understand it without
this aid. Any time spent in this way, even if it ends without success, will make our solutions
∗wax@alum.mit.edu
†David.Epstein@warwick.ac.uk
1
easier to understand and more memorable. Quite likely you will find better solutions than
those provided here—let us know when you do!
As far as teachers of courses based on the text are concerned, our notes may be regarded as
a disadvantage, because it may be felt that the exercises in the book can no longer be used
as a source of homework or questions for tests and examinations. This is a slight conflict
of interest between teachers of courses on the one hand, and independent learners on the
other hand. Finding ourselves in the ranks of the independent learners, the position we take
is hardly surprising. Teachers of courses might benefit by comparing their own solutions
with ours, as might students in classes and independent learners. If you, the reader, find
a problem difficult and become stuck, our notes might enable you to unstick yourself. In
addition, there is a myriad of materials on any of the topics this book covers. A search for
“statistical learning theory” on Google (as of 1 January 2010) gave over 4 million hits. The
possible disadvantage of not being able to use the book’s problems in an academic course
is not really such a large one. Obtaining additional supplementary problems at the right
level for an academic course is simply a matter of a visit to the library, or a little time spent
surfing the net. And, of course, although one is not supposed to say this, teachers of courses
can also get stuck.
Acknowledgments
We are hoping that people will find errors, offer insightful suggestions, and generally improve
the understanding of the text, the exercises and of our notes, and that they write to us
about this. We will incorporate additions that we think are helpful. Our plan is to gradually
add material as we work through the book. Comments and criticisms are always welcome.
Special thanks to (most recent comments are listed first): Liuzhou Zhuo for his comments
on Exercise 3.25 and Ruchi Dhiman for his comments on Chapter 4.
We use the numbering found in the on-line (second edition) version of this text. The first
edition should be similar but may have some differences.
2
Chapter 2 (Overview of Supervised Learning)
Statistical Decision Theory
We assume a linear model: that is we assume y = f (x) + ε, where ε is a random variable
with mean 0 and variance σ2, and f (x) = xT β. Our expected predicted error (EPE) under
the squared error loss is
EPE(β) =Z (y − xT β)2Pr(dx, dy) .
(1)
We regard this expression as a function of β, a column vector of length p + 1. In order to
find the value of β for which it is minimized, we equate to zero the vector derivative with
respect to β. We have
∂EPE
∂β
=Z 2 (y − xT β) (−1)x Pr(dx, dy) = −2Z (y − xT β)xPr(dx, dy) .
(2)
Now this expression has two parts. The first has integrand yx and the second has integrand
(xT β)x.
Before proceeding, we make a quick general remark about matrices. Suppose that A, B and
C are matrices of size 1 × p matrix, p × 1 and q × 1 respectively, where p and q are positive
integers. Then AB can be regarded as a scalar, and we have (AB)C = C(AB), each of these
expressions meaning that each component of C is multiplied by the scalar AB. If q > 1,
the expressions BC, A(BC) and ABC are meaningless, and we must avoid writing them.
On the other hand, CAB is meaningful as a product of three matrices, and the result is the
q × 1 matrix (AB)C = C(AB) = CAB. In our situation we obtain (xT β)x = xxT β.
From ∂EPE/∂β = 0 we deduce
for the particular value of β that minimizes the EPE. Since this value of β is a constant, it
can be taken out of the expectation to give
E[yx] − E[xxT β] = 0
(3)
which gives Equation 2.16 in the book.
β = E[xxT ]−1E[yx] ,
We now discuss some points around Equations 2.26 and 2.27. We have
ˆβ = (X T X)−1X T y = (X T X)−1X T (Xβ + ε) = β + (X T X)−1X T ε.
So
This immediately gives
ˆy0 = xT
0
ˆβ = xT
0 β + xT
0 (X T X)−1X T ε.
(4)
(5)
ˆy0 = xT
0 β +
ℓi(x0)εi
NXi=1
3
where ℓi(x0) is the i-th element of the N-dimensional column vector X(X T X)−1x0, as stated
at the bottom of page 24 of the book.
We now consider Equations 2.27 and 2.28. The variation is over all training sets T , and over
all values of y0, while keeping x0 fixed. Note that x0 and y0 are chosen independently of T
and so the expectations commute: Ey0|x0ET = ET Ey0|x0. Also ET = EX EY|X .
We write y0 − ˆy0 as the sum of three terms
y0 − xT
0 β − (ˆy0 − ET (ˆy0)) −ET (ˆy0) − xT
0 β = U1 − U2 − U3.
In order to prove Equations 2.27 and 2.28, we need to square the expression in Equation 6
and then apply various expectation operators. First we consider the properties of each of
the three terms, Ui in Equation 6. We have Ey0|x0U1 = 0 and ET U1 = U1ET . Despite the
notation, ˆy0 does not involve y0. So Ey0|x0U2 = U2Ey0|x0 and clearly ET U2 = 0. Equation 5
gives
(6)
(7)
since the expectation of the length N vector ε is zero. This shows that the bias U3 is zero.
U3 = ET (ˆy0) − xT
0 β = xT
0 EX(X T X)−1X T EY|X ε = 0
We now square the remaining part of Equation 6 and then then apply Ey0|x0ET . The cross-
term U1U2 gives zero, since Ey0|x0(U1U2) = U2Ey0|x0(U1) = 0. (This works in the same way
if ET replaces Ey0|x0.)
We are left with two squared terms, and the definition of variance enables us to deal imme-
diately with the first of these: Ey0|x0ET U 2
1 = Var(y0|x0) = σ2. It remains to deal with the
term ET (ˆy0 − ET ˆy0)2 = VarT (ˆy0) in Equation 2.27. Since the bias U3 = 0, we know that
ET ˆy0 = xT
0 β.
If m is the 1× 1-matrix with entry µ, then mmT is the 1× 1-matrix with enty µ2. It follows
from Equation 5 that the variance term in which we are interested is equal to
Since ET = EX EY|X , and the expectation of εεT is σ2IN , this is equal to
ET xT
0 (X T X)−1X T εεT X(X T X)−1x0 .
σ2xT
0 ET (X T X)−1 x0 = σ2xT
0 EXX T X/N−1 x0/N.
(8)
We suppose, as stated by the authors, that the mean of the distribution giving rise to X
and x0 is zero. For large N, X T X/N is then approximately equal to Cov(X) = Cov(x0),
the p × p-matrix-variance-covariance matrix relating the p components of a typical sample
vector x—as far as EX is concerned, this is a constant. Applying Ex0 to Equation 8 as in
Equation 2.28, we obtain (approximately)
σ2Ex0xT
0 Cov(X)−1x0 /N = σ2Ex0tracexT
0 Cov(X)−1x0 /N
0 /N
= σ2Ex0traceCov(X)−1x0xT
= σ2traceCov(X)−1Cov(x0) /N
= σ2trace(Ip)/N
= σ2p/N.
(9)
4
This completes our discussion of Equations 2.27 and 2.28.
Notes on Local Methods in High Dimensions
The most common error metric used to compare different predictions of the true (but un-
known) mapping function value f (x0) is the mean square error (MSE). The unknown in the
above discussion is the specific function mapping function f (·) which can be obtained via
different methods many of which are discussed in the book. In supervised learning to help
with the construction of an appropriate prediction ˆy0 we have access to a set of “training
samples” that contains the notion of randomness in that these points are not under complete
control of the experimenter. One could ask the question as to how much square error at our
predicted input point x0 will have on average when we consider all possible training sets T .
We can compute, by inserting the “expected value of the predictor obtained over all training
sets”, ET (ˆy0) into the definition of quadratic (MSE) error as
MSE(x0) = ET [f (x0) − ˆy0]2
= ET [ˆy0 − ET (ˆy0) + ET (ˆy0) − f (x0)]2
= ET [(ˆy0 − ET (ˆy0))2 + 2(ˆy0 − ET (ˆy0))(ET (ˆy0) − f (x0)) + (ET (ˆy0) − f (x0))2]
= ET [(ˆy0 − ET (ˆy0))2] + (ET (ˆy0) − f (x0))2 .
Where we have expanded the quadratic, distributed the expectation across all terms, and
noted that the middle term vanishes since it is equal to
ET [2(ˆy0 − ET (ˆy0))(ET (ˆy0) − f (x0))] = 0 ,
because ET (ˆy0) − ET (ˆy0) = 0. and we are left with
MSE(x0) = ET [(ˆy0 − ET (ˆy0))2] + (ET (ˆy0) − f (x0))2 .
(10)
The first term in the above expression ET [(ˆy0 − ET (ˆy0))2] is the variance of our estimator ˆy0
and the second term (ET (ˆy0) − f (x0))2 is the bias (squared) of our estimator. This notion
of variance and bias with respect to our estimate ˆy0 is to be understood relative to possible
training sets, T , and the specific computational method used in computing the estimate ˆy0
given that training set.
Exercise Solutions
Ex. 2.1 (target coding)
The authors have suppressed the context here, making the question a little mysterious. For
example, why use the notation ¯y instead of simply y? We imagine that the background is
something like the following. We have some input data x. Some algorithm assigns to x the
probability yk that x is a member of the k-th class. This would explain why we are told to
assume that the sum of the yk is equal to one. (But, if that reason is valid, then we should
5
also have been told that each yk ≥ 0.) In fact, neither of these two assumptions is necessary
to provide an answer to the question. The hyphen in K-classes seems to be a misprint, and
should be omitted.
We restate the question, clarifying it, but distancing it even further from its origins. Let
K > 0 be an integer. For each k with 1 ≤ k ≤ K, let tk be the K-dimensional vector that
has 1 in the k-th position and 0 elsewhere. Then, for any K-dimensional vector y, the k for
which yk is largest coincides with the k for which tk is nearest to y.
By expanding the quadratic we find that
argmink||y − tk|| = argmink||y − tk||2
= argmink
= argmink
= argmink
(yi − (tk)i)2
KXi=1
KXi=1(yi)2 − 2yi(tk)i + (tk)i
KXi=1−2yi(tk)i + (tk)i
2 ,
2
since the sum PK
PK
k=1 (tk)i
i=1 yi
2 = 1. AlsoP yi(tk)i = yk. This means that
2 is the same for all classes k. Notice that, for each k, the sum
argmink||y − tk|| = argmink (−2yk + 1)
= argmink(−2yk)
= argmaxkyk .
Ex. 2.2 (the oracle revealed)
For this problem one is supposed to regard the points pi and qi below as fixed. If one does
not do this, and instead averages over possible choices, then since the controlling points are
(1,0) and (0,1), and all probabilities are otherwise symmetric between these points when one
integrates out all the variation, the answer must be that the boundary is the perpendicular
bisector of the interval joining these two points.
The simulation draws 10 points p1, . . . , p10 ∈ R2 from N 1
R2 from N 0
0 , I2 and 10 points q1, . . . , q10 ∈
1 , I2. The formula for the Bayes decision boundary is given by equating
likelihoods. We get an equation in the unknown z ∈ R2, giving a curve in the plane:
Xi
exp(−5kpi − zk2/2) =Xj
exp(−5kqj − zk2/2) .
6
In this solution, the boundary is given as the equation of equality between the two proba-
bilities, with the pi and qj constant and fixed by previously performed sampling. Each time
one re-samples the pi and qj, one obtains a different Bayes decision boundary.
Ex. 2.3 (the median distance to the origin)
We denote the N-tuple of data points by (x1, . . . , xN ). Let ri = kxik. Let U(A) be the set
of all N-tuples with A < r1 < . . . < rN < 1. Ignoring subsets of measure zero, the set of all
N-tuples is the disjoint union of the N! different subsets obtained from U(0) by permuting
the indexing set (1, . . . , N). We will look for A > 0 such that the measure of U(A) is half
the measure of U(0). The same A will work for each of our N! disjoint subsets, and will
therefore give the median for the distance of the smallest xi from the origin.
We want to find A such that
ZU (A)
dx1 . . . dxN =
1
2ZU (0)
dx1 . . . dxN .
We convert to spherical coordinates. Since the coordinate in the unit sphere Sp−1 contributes
the same constant on each side of the equality, obtaining
ZA 0,
which is an N-dimensional simplex scaled down by a factor (1−Ap). The right-hand integral
is dealt with in the same way, setting A = 0. Since the region of integration is N-dimensional,
the measure is multiplied by (1 − Ap)N . We solve for A by solving (1 − Ap)N = 1/2. We
obtain A = (1 − 2−1/N )1/p, as required.
7
Ex. 2.4 (projections aT x are distributed as normal N(0, 1))
The main point is thatPkxik2 is invariant under the orthogonal group. As a consequence
the standard normal distribution exists on any finite dimensional inner product space (by
fixing an orthonormal basis). Further, if Rp is written as the orthogonal sum of two vector
subspaces, then the product of standard normal distributions on each of the subspaces gives
the standard normal distribution on Rp. Everything else follows from this.
The on-line edition is correct except that √10 ≈ 3.162278. So, the figure given should be
3.2 instead of 3.1. Note that the version of this question posed in the first edition makes
incorrect claims. The first edition talks of the “center of the training points” and the on-line
edition talks of the “origin”. The answers are very different. This is shown up best by taking
only one training point.
Ex. 2.5 (the expected prediction error under least squares)
Part (a): In order to discuss Equation (2.27), we go back to (2.26) on page 24. We have
y = Xβ + ε, where y and ε are N × 1, X is N × p, and β is p × 1. Hence
ˆβ = (X T X)−1X T y = β + (X T X)−1X T ε.
Since X and ε are independent variables, ET (ε) = 0 and ET (εT ε) = σ2IN , we have ET ( ˆβ) = β
and
VarT ( ˆβ) = ET ( ˆβT β) − ET ( ˆβ)ET ( ˆβT ) = (X T X)−1σ2.
Now we prove (2.27) on page 26. Note that y0 is constant for the distribution T . Note also
ˆβ does not depend on y0 and so the same is true for
that, if x0 is held constant, ˆy0 = xT
0
ET (ˆy0) and VarT (ˆy0). Let u = Ey0|x0(y0) = xT
0 β. Then
ET (y0 − ˆy0)2 =y2
0 − u2 +ET (ˆy2
0) − (ET ˆy0)2 +(ET ˆy0)2 − 2y0ET ˆy0 + u2 .
Therefore
We have
Ey0|x0ET (y0 − ˆy0)2 = Var(y0|x0) + VarT (ˆy0) + (ET (ˆy0) − u)2 .
ET (ˆy0) = xT
0 ET ( ˆβ) = xT
0 β = u.
Part (b): If A is a p×q matrix and B is a q×p matrix, then it is standard that trace(AB) =
0 (X T X)−1x0 is 1×1 and is therefore
trace(BA). Note that x0 is p×1 and X is N×p, so that xT
is approximately equal to σ2σ−2trace(Ip)/N = p/N.
equal to its own trace. Therefore Ex0xT
0 (X T X)−1x0 = traceEx0(x0xT
0 (X T X)−1) which
8