The Work, Computer Age Statistical Inference, was first published by Cambridge University Press.
c in the Work, Bradley Efron and Trevor Hastie, 2016.
Cambridge University Press’s catalogue entry for the Work can be found at http: // www. cambridge. org/
9781107149892
NB: The copy of the Work, as displayed on this website, can be purchased through Cambridge University
Press and other standard distribution channels. This copy is made available for personal use only and must
not be adapted, sold or re-distributed.
Corrected November 10, 2017.
The twenty-first century has seen a breathtaking expansion of statistical methodology, both in scope and in influence. “Big data,” “data science,” and “machine learning” have become familiar terms in the news, as statistical methods are brought to bear upon the enormous data sets of modern science and commerce. How did we get here? And where are we going?This book takes us on an exhilarating journey through the revolution in data analysis following the introduction of electronic computation in the 1950s. Beginning with classical inferential theories – Bayesian, frequentist, Fisherian – individual chapters take up a series of influential topics: survival analysis, logistic regression, empirical Bayes, the jackknife and bootstrap, random forests, neural networks, Markov chain Monte Carlo, inference after model selection, and dozens more. The distinctly modern approach integrates methodology and algorithms with statistical inference. The book ends with speculation on the future direction of statistics and data science.Efron & hastiEComputEr agE statistiCal infErEnCE“How and why is computational statistics taking over the world? In this serious work of synthesis that is also fun to read, Efron and Hastie give their take on the unreasonable effectiveness of statistics and machine learning in the context of a series of clear, historically informed examples.”— Andrew Gelman, Columbia University “Computer Age Statistical Inference is written especially for those who want to hear the big ideas, and see them instantiated through the essential mathematics that defines statistical analysis. It makes a great supplement to the traditional curricula for beginning graduate students.”— Rob Kass, Carnegie Mellon University “This is a terrific book. It gives a clear, accessible, and entertaining account of the interplay between theory and methodological development that has driven statistics in the computer age. The authors succeed brilliantly in locating contemporary algorithmic methodologies for analysis of ‘big data’ within the framework of established statistical theory.”— Alastair Young, Imperial College London “This is a guided tour of modern statistics that emphasizes the conceptual and computational advances of the last century. Authored by two masters of the field, it offers just the right mix of mathematical analysis and insightful commentary.”— Hal Varian, Google “Efron and Hastie guide us through the maze of breakthrough statistical methodologies following the computing evolution: why they were developed, their properties, and how they are used. Highlighting their origins, the book helps us understand each method’s roles in inference and/or prediction.”— Galit Shmueli, National Tsing Hua University “A masterful guide to how the inferential bases of classical statistics can provide a principled disciplinary frame for the data science of the twenty-first century.” — Stephen Stigler, University of Chicago, author of Seven Pillars of Statistical Wisdom “A refreshing view of modern statistics. Algorithmics are put on equal footing with intuition, properties, and the abstract arguments behind them. The methods covered are indispensable to practicing statistical analysts in today’s big data and big computing landscape.”— Robert Gramacy, The University of Chicago Booth School of BusinessBradley Efron is Max H. Stein Professor, Professor of Statistics, and Professor of Biomedical Data Science at Stanford University. He has held visiting faculty appointments at Harvard, UC Berkeley, and Imperial College London. Efron has worked extensively on theories of statistical inference, and is the inventor of the bootstrap sampling technique. He received the National Medal of Science in 2005 and the Guy Medal in Gold of the Royal Statistical Society in 2014. Trevor Hastie is John A. Overdeck Professor, Professor of Statistics, and Professor of Biomedical Data Science at Stanford University. He is coauthor of Elements of Statistical Learning, a key text in the field of modern data analysis. He is also known for his work on generalized additive models and principal curves, and for his contributions to the R computing environment. Hastie was awarded the Emmanuel and Carol Parzen prize for Statistical Innovation in 2014. Institute of Mathematical Statistics MonographsEditorial Board:D. R. Cox (University of Oxford)B. Hambly (University of Oxford)S. Holmes (Stanford University)J. Wellner (University of Washington)Cover illustration: Pacific Ocean wave, North Shore, Oahu, Hawaii. © Brian Sytnyk / Getty Images.Cover designed by Zoe Naylor.PRINTED IN THE UNITED KINGDOMComputEr agE statistiCal infErEnCEalgorithms, EvidEnCE, and data sCiEnCEBradlEy Efron trEvor hastiE9781107149892 Efron & Hastie JKT C M Y K
Computer Age Statistical Inference
Algorithms, Evidence, and Data Science
Bradley Efron
Trevor Hastie
Stanford University
To Donna and Lynda
viii
Contents
Preface
Acknowledgments
Notation
1
1.1
1.2
1.3
2
2.1
2.2
2.3
3
3.1
3.2
3.3
3.4
3.5
4
4.1
4.2
4.3
4.4
4.5
5
Part I Classic Statistical Inference
Algorithms and Inference
A Regression Example
Hypothesis Testing
Notes
Frequentist Inference
Frequentism in Practice
Frequentist Optimality
Notes and Details
Bayesian Inference
Two Examples
Uninformative Prior Distributions
Flaws in Frequentist Inference
A Bayesian/Frequentist Comparison List
Notes and Details
Fisherian Inference and Maximum Likelihood Estimation
Likelihood and Maximum Likelihood
Fisher Information and the MLE
Conditional Inference
Permutation and Randomization
Notes and Details
Parametric Models and Exponential Families
ix
xv
xviii
xix
1
3
4
8
11
12
14
18
20
22
24
28
30
33
36
38
38
41
45
49
51
53
x
5.1
5.2
5.3
5.4
5.5
5.6
6
6.1
6.2
6.3
6.4
6.5
7
7.1
7.2
7.3
7.4
7.5
8
8.1
8.2
8.3
8.4
8.5
Contents
Univariate Families
The Multivariate Normal Distribution
Fisher’s Information Bound for Multiparameter Families
The Multinomial Distribution
Exponential Families
Notes and Details
Part II Early Computer-Age Methods
Empirical Bayes
Robbins’ Formula
The Missing-Species Problem
A Medical Example
Indirect Evidence 1
Notes and Details
James–Stein Estimation and Ridge Regression
The James–Stein Estimator
The Baseball Players
Ridge Regression
Indirect Evidence 2
Notes and Details
Generalized Linear Models and Regression Trees
Logistic Regression
Generalized Linear Models
Poisson Regression
Regression Trees
Notes and Details
Survival Analysis and the EM Algorithm
Life Tables and Hazard Rates
Censored Data and the Kaplan–Meier Estimate
The Log-Rank Test
The Proportional Hazards Model
9
9.1
9.2
9.3
9.4
9.5 Missing Data and the EM Algorithm
9.6
Notes and Details
The Jackknife and the Bootstrap
10
10.1 The Jackknife Estimate of Standard Error
10.2 The Nonparametric Bootstrap
10.3 Resampling Plans
54
55
59
61
64
69
73
75
75
78
84
88
88
91
91
94
97
102
104
108
109
116
120
124
128
131
131
134
139
143
146
150
155
156
159
162