Springer Texts in Statistics
Gareth James
Daniela Witten
Trevor Hastie
Robert Tibshirani
An Introduction
to Statistical
Learning
with Applications in R
Springer Texts in Statistics
Gareth James · Daniela Witten · Trevor Hastie · Robert Tibshirani
An Introduction to Statistical Learning
with Applications in R
An Introduction to Statistical Learning provides an accessible overview of the fi eld
of statistical learning, an essential toolset for making sense of the vast and complex
data sets that have emerged in fi elds ranging from biology to fi nance to marketing to
astrophysics in the past twenty years. Th is book presents some of the most important
modeling and prediction techniques, along with relevant applications. Topics include
linear regression, classifi cation, resampling methods, shrinkage approaches, tree-based
methods, support vector machines, clustering, and more. Color graphics and real-world
examples are used to illustrate the methods presented. Since the goal of this textbook
is to facilitate the use of these statistical learning techniques by practitioners in sci-
ence, industry, and other fi elds, each chapter contains a tutorial on implementing the
analyses and methods presented in R, an extremely popular open source statistical
soft ware platform.
Two of the authors co-wrote Th e Elements of Statistical Learning (Hastie, Tibshirani
and Friedman, 2nd edition 2009), a popular reference book for statistics and machine
learning researchers. An Introduction to Statistical Learning covers many of the same
topics, but at a level accessible to a much broader audience. Th is book is targeted at
statisticians and non-statisticians alike who wish to use cutting-edge statistical learn-
ing techniques to analyze their data. Th e text assumes only a previous course in linear
regression and no knowledge of matrix algebra.
Gareth James is a professor of statistics at University of Southern California. He has
published an extensive body of methodological work in the domain of statistical learn-
ing with particular emphasis on high-dimensional and functional data. Th e conceptual
framework for this book grew out of his MBA elective courses in this area.
Daniela Witten is an assistant professor of biostatistics at University of Washington. Her
research focuses largely on high-dimensional statistical machine learning. She has
contributed to the translation of statistical learning techniques to the fi eld of genomics,
through collaborations and as a member of the Institute of Medicine committee that
led to the report Evolution of Translational Omics.
Trevor Hastie and Robert Tibshirani are professors of statistics at Stanford University, and
are co-authors of the successful textbook Elements of Statistical Learning. Hastie and
Tibshirani developed generalized additive models and wrote a popular book of that
title. Hastie co-developed much of the statistical modeling soft ware and environment
in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso
and is co-author of the very successful An Introduction to the Bootstrap.
Statistics
I SBN 978- 1- 4614- 7137- 0
9 781461 471370
STS
J
a
m
e
s
·
W
i
t
t
e
n
·
H
a
s
t
i
e
·
T
i
b
s
h
i
r
a
n
i
1
A
n
I
n
t
r
o
d
u
c
t
i
o
n
t
o
S
t
a
t
i
s
t
i
c
a
l
L
e
a
r
n
n
g
i
Springer Texts in Statistics
103
Series Editors:
G. Casella
S. Fienberg
I. Olkin
For further volumes:
http://www.springer.com/series/417
Gareth James • Daniela Witten • Trevor Hastie
Robert Tibshirani
An Introduction to
Statistical Learning
with Applications in R
123
Gareth James
Department of Information and
Operations Management
University of Southern California
Los Angeles, CA, USA
Trevor Hastie
Department of Statistics
Stanford University
Stanford, CA, USA
Daniela Witten
Department of Biostatistics
University of Washington
Seattle, WA, USA
Robert Tibshirani
Department of Statistics
Stanford University
Stanford, CA, USA
ISSN 1431-875X
ISBN 978-1-4614-7137-0
DOI 10.1007/978-1-4614-7138-7
Springer New York Heidelberg Dordrecht London
ISBN 978-1-4614-7138-7 (eBook)
Library of Congress Control Number: 2013936251
© Springer Science+Business Media New York 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissim-
ilar methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the pur-
pose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of the
Copyright Law of the Publisher’s location, in its current version, and permission for use must always
be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publi-
cation does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
To our parents:
Alison and Michael James
Chiara Nappi and Edward Witten
Valerie and Patrick Hastie
Vera and Sami Tibshirani
and to our families:
Michael, Daniel, and Catherine
Ari
Samantha, Timothy, and Lynda
Charlie, Ryan, Julie, and Cheryl
Preface
Statistical learning refers to a set of tools for modeling and understanding
complex datasets. It is a recently developed area in statistics and blends
with parallel developments in computer science and, in particular, machine
learning. The field encompasses many methods such as the lasso and sparse
regression, classification and regression trees, and boosting and support
vector machines.
With the explosion of “Big Data” problems, statistical learning has be-
come a very hot field in many scientific areas as well as marketing, finance,
and other business disciplines. People with statistical learning skills are in
high demand.
One of the first books in this area—The Elements of Statistical Learning
(ESL) (Hastie, Tibshirani, and Friedman)—was published in 2001, with a
second edition in 2009. ESL has become a popular text not only in statis-
tics but also in related fields. One of the reasons for ESL’s popularity is
its relatively accessible style. But ESL is intended for individuals with ad-
vanced training in the mathematical sciences. An Introduction to Statistical
Learning (ISL) arose from the perceived need for a broader and less tech-
nical treatment of these topics. In this new book, we cover many of the
same topics as ESL, but we concentrate more on the applications of the
methods and less on the mathematical details. We have created labs illus-
trating how to implement each of the statistical learning methods using the
popular statistical software package R. These labs provide the reader with
valuable hands-on experience.
This book is appropriate for advanced undergraduates or master’s stu-
dents in statistics or related quantitative fields or for individuals in other
vii