Statistics and Computing
James E.Gentle
Computational
Statistics
Statistics and Computing
Series Editors:
J. Chambers
D. Hand
W. Härdle
For other titles published in this series, go to
http://www.springer.com/series/3022
James E. Gentle
Computational Statistics
J.E. Gentle
Department of Computational & Data Sciences
George Mason University
4400, University Drive
Fairfax, VA 22030-4444
USA
jgentle@gmu.edu
Series Editors:
J. Chambers
Department of Statistics
Sequoia Hall
390 Serra Mall
Stanford University
Stanford, CA 94305-4065
D. Hand
Department of Mathematics
Imperial College London,
South Kensington Campus
London SW7 2AZ
United Kingdom
W. Härdle
Institut für Statistik
und Ökonometrie
Humboldt-Universität
zu Berlin
Spandauer Str. 1
D-10178 Berlin
Germany
ISBN 978-0-387-98143-7
DOI 10.1007/978-0-387-98144-4
Springer Dordrecht Heidelberg London New York
e-ISBN 978-0-387-98144-4
Library of Congress Control Number: 2009929633
© Springer Science+Business Media, LLC 2009
All rights reserved.
This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY
10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection
with any form of information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
To Mar´ıa
Preface
This book began as a revision of Elements of Computational Statistics, pub-
lished by Springer in 2002. That book covered computationally-intensive sta-
tistical methods from the perspective of statistical applications, rather than
from the standpoint of statistical computing.
Most of the students in my courses in computational statistics were in a
program that required multiple graduate courses in numerical analysis, and so
in my course in computational statistics, I rarely covered topics in numerical
linear algebra or numerical optimization, for example. Over the years, how-
ever, I included more discussion of numerical analysis in my computational
statistics courses. Also over the years I have taught numerical methods courses
with no or very little statistical content. I have also accumulated a number
of corrections and small additions to the elements of computational statistics.
The present book includes most of the topics from Elements and also incor-
porates this additional material. The emphasis is still on computationally-
intensive statistical methods, but there is a substantial portion on the numer-
ical methods supporting the statistical applications.
I have attempted to provide a broad coverage of the field of computational
statistics. This obviously comes at the price of depth.
Part I, consisting of one rather long chapter, presents some of the most
important concepts and facts over a wide range of topics in intermediate-level
mathematics, probability and statistics, so that when I refer to these concepts
in later parts of the book, the reader has a frame of reference.
Part I attempts to convey the attitude that computational inference, to-
gether with exact inference and asymptotic inference, is an important com-
ponent of statistical methods.
Many statements in Part I are made without any supporting argument,
but references and notes are given at the end of the chapter. Most readers
and students in courses in statistical computing or computational statistics
will be familiar with a substantial proportion of the material in Part I, but I
do not recommend skipping the chapter. If readers are already familiar with
the material, they should just read faster. The perspective in this chapter is
viii
Preface
that of the “big picture”. As is often apparent in oral exams, many otherwise
good students lack a basic understanding of what it is all about.
A danger is that the student or the instructor using the book as a text
will too quickly gloss over Chapter 1 and miss some subtle points.
Part II addresses statistical computing, a topic dear to my heart. There are
many details of the computations that can be ignored by most statisticians,
but no matter at what level of detail a statistician needs to be familiar with
the computational topics of Part II, there are two simple, higher-level facts
that all statisticians should be aware of and which I state often in this book:
Computer numbers are not the same as real numbers, and the arith-
metic operations on computer numbers are not exactly the same as
those of ordinary arithmetic.
and
The form of a mathematical expression and the way the expression
should be evaluated in actual practice may be quite different.
Regarding the first statement, some of the differences in real numbers and
Part III in six relatively short chapters addresses methods and techniques
of computational statistics. I think that any exploration of data should begin
with graphics, and the first chapter in Part III, Chapter 8, presents a brief
overview of some graphical methods, especially those concerned with multi-
dimensional data. The more complicated the structure of the data and the
higher the dimension, the more ingenuity is required for visualization of the
data; it is, however, in just those situations that graphics is most important.
The orientation of the discussion on graphics is that of computational statis-
tics; the emphasis is on discovery, and the important issues that should be
considered in making presentation graphics are not addressed.
computer numbers are summarized in Table 2.1 on page 98.
A prime example of the second statement is the use of the normal equations
in linear regression, X TXb = X Ty. It is quite appropriate to write and discuss
these equations. We might consider the elements of X TX, and we might even
write the least squares estimate of β as b = (X TX)−1X Ty. That does not
mean that we ever actually compute X TX or (X TX)−1, although we may
compute functions of those matrices or even certain elements of them.
The most important areas of statistical computing (to me) are
• computer number systems
• algorithms and programming
• function approximation and numerical quadrature
• numerical linear algebra
• solution of nonlinear equations and optimization
• generation of random numbers.
These topics are the subjects of the individual chapters of Part II.
Preface
ix
Chapter 9 discusses methods of projecting higher-dimensional data into
lower dimensions. The tools discussed in Chapter 9 will also be used in Part IV
for clustering and classification, and, in general, for exploring structure in
data. Chapter 10 covers some of the general issues in function estimation,
building on the material in Chapter 4 on function approximation.
Chapter 11 is about Monte Carlo simulation and some of its uses in com-
putational inference, including Monte Carlo tests, in which artificial data are
generated according to a hypothesis. Chapters 12 and 13 discuss computa-
tional inference using resampling and partitioning of a given dataset. In these
methods, a given dataset is used, but Monte Carlo sampling is employed re-
peatedly on the data. These methods include randomization tests, jackknife
techniques, and bootstrap methods, in which data are generated from the
empirical distribution of a given sample, that is, the sample is resampled.
Identification of interesting features, or “structure”, in data is an impor-
tant activity in computational statistics. In Part IV, I consider the problem of
identification of structure and the general problem of estimation of probability
densities. In simple cases, or as approximations in more realistic situations,
structure may be described in terms of functional relationships among the
variables in a dataset.
The most useful and complete description of a random data-generating
process is the associated probability density, if it exists. Estimation of this
special type of function is the topic of Chapters 14 and 15, building on gen-
eral methods discussed in earlier chapters, especially Chapter 10. If the data
follow a parametric distribution, or rather, if we are willing to assume that
the data follow a parametric distribution, identification of the probability den-
sity is accomplished by estimation of the parameters. Nonparametric density
estimation is considered in Chapter 15.
Features of interest in data include clusters of observations and relation-
ships among variables that allow a reduction in the dimension of the data.
I discuss methods for statistical learning in Chapter 16, building on some of
the basic measures introduced in Chapter 9.
Higher-dimensional data have some surprising and counterintuitive proper-
ties, and I discuss some of the interesting characteristics of higher dimensions.
In Chapter 17, I discuss asymmetric relationships among variables. For
such problems, the objective often is to estimate or predict a response for
a given set of explanatory or predictive variables, or to identify the class
to which an observation belongs. The approach is to use a given dataset to
develop a model or a set of rules that can be applied to new data. Statistical
modeling may be computationally intensive because of the number of possible
forms considered or because of the recursive partitioning of the data used in
selecting a model. In computational statistics, the emphasis is on building a
model rather than just estimating the parameters in the model. Parametric
estimation, of course, plays an important role in building models.
Many of the topics addressed in this book could easily be (and are) sub-
jects for full-length books. My intent is to describe these methods in a general