Springer Texts in Statistics
Series Editors:
G. Casella
S. Fienberg
I. Olkin
Springer Texts in Statistics
For other titles published in this series, go to
www.springer.com/series/417
Alan Julian Izenman
Modern Multivariate
Statistical Techniques
Regression, Classification,
and Manifold Learning
123
Alan J. Izenman
Department of Statistics
Temple University
Speakman Hall
Philadelphia, PA 19122
USA
alan@temple.edu
Editorial Board
George Casella
Department of Statistics
University of Florida
Gainesville,
8545
USA
FL 32611-
Stephen Fienberg
Department of Statistics
Carnegie Mellon University
Pittsburgh, PA 15213-3890
USA
l
Ingram O kin
Department of Statistics
Stanford University
Stanford, CA 94305
USA
ISSN: 1431-875X
ISBN: 978-0-387-78188-4
DOI: 10.1007/978-0-387-78189-1
e-ISBN: 978-0-387-78189-1
Library of Congress Control Number: 2008928720.
c 2008 Springer Science+Business Media, LLC
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they
are not identified as such, is not to be taken as an expression of opinion as to whether or not they are
subject to proprietary rights.
Printed on acid-free paper
springer.com
This book is dedicated
to the memory of my parents,
Kitty and Larry,
and to my family,
Betty-Ann and Kayla
Preface
Not so long ago, multivariate analysis consisted solely of linear methods
illustrated on small to medium-sized data sets. Moreover, statistical com-
puting meant primarily batch processing (often using boxes of punched
cards) carried out on a mainframe computer at a remote computer facil-
ity. During the 1970s, interactive computing was just beginning to raise its
head, and exploratory data analysis was a new idea. In the decades since
then, we have witnessed a number of remarkable developments in local
computing power and data storage. Huge quantities of data are being col-
lected, stored, and efficiently managed, and interactive statistical software
packages enable sophisticated data analyses to be carried out effortlessly.
These advances enabled new disciplines called data mining and machine
learning to be created and developed by researchers in computer science
and statistics.
As enormous data sets become the norm rather than the exception, sta-
tistics as a scientific discipline is changing to keep up with this development.
Instead of the traditional heavy reliance on hypothesis testing, attention
is now being focused on information or knowledge discovery. Accordingly,
some of the recent advances in multivariate analysis include techniques
from computer science, artificial intelligence, and machine learning theory.
Many of these new techniques are still in their infancy, waiting for statistical
theory to catch up.
The origins of some of these techniques are purely algorithmic, whereas
the more traditional techniques were derived through modeling, optimiza-
viii
Preface
tion, or probabilistic reasoning. As such algorithmic techniques mature, it
becomes necessary to build a solid statistical framework within which to
embed them. In some instances, it may not be at all obvious why a partic-
ular technique (such as a complex algorithm) works as well as it does:
When new ideas are being developed, the most fruitful approach
is often to let rigor rest for a while, and let intuition reign — at
least in the beginning. New methods may require new concepts
and new approaches, in extreme cases even a new language, and
it may then be impossible to describe such ideas precisely in the
old language.
— Inge S. Helland, 2000
It is hoped that this book will be enjoyed by those who wish to under-
stand the current state of multivariate statistical analysis in an age of high-
speed computation and large data sets. This book mixes new algorithmic
techniques for analyzing large multivariate data sets with some of the more
classical multivariate techniques. Yet, even the classical methods are not
given only standard treatments here; many of them are also derived as spe-
cial cases of a common theoretical framework (multivariate reduced-rank
regression) rather than separately through different approaches. Another
major feature of this book is the novel data sets that are used as examples
to illustrate the techniques.
I have included as much statistical theory as I believed is necessary to
understand the development of ideas, plus details of certain computational
algorithms; historical notes on the various topics have also been added
wherever possible (usually in the Bibliographical Notes at the end of each
chapter) to help the reader gain some perspective on the subject matter.
References at the end of the book should be considered as extensive without
being exhaustive.
Some common abbreviations used in this book should be noted: “iid”
means independently and identically distributed; “wrt” means with respect
to; and “lhs” and “rhs” mean left- and right-hand side, respectively.
Audience
This book is directed toward advanced undergraduate students, gradu-
ate students, and researchers in statistics, computer science, artificial in-
telligence, psychology, neural and cognitive sciences, business, medicine,
bioinformatics, and engineering. As prerequisites, readers are expected to
have had previous knowledge of probability, statistical theory and methods,
multivariable calculus, and linear/matrix algebra. Because vectors and ma-
trices play such a major role in multivariate analysis, Chapter 3 gives the
matrix notation used in the book and many important advanced concepts
in matrix theory. Along with a background in classical statistical theory
Preface
ix
and methods, it would also be helpful if the reader had some exposure to
Bayesian ideas in statistics.
There are various types of courses for which this book can be used, in-
cluding data mining, machine learning, computational statistics, and for
a traditional course in multivariate analysis. Sections of this book have
been used at Temple University as the basis of lectures in a one-semester
course in applied multivariate analysis to statistics and graduate business
students (where technical derivations are skipped and emphasis is placed
on the examples and computational algorithms) and a two-semester course
in advanced topics in statistics given to graduate students from statistics,
computer science, and engineering. I am grateful for their feedback (includ-
ing spotting typos and inconsistencies).
Although there is enough material in this book for a two-semester course,
a one-semester course in traditional multivariate analysis can be drawn
from the material in Sections 1.1–1.3, 2.1–2.3, 2.5, 2.6, 3.1–3.5, 5.1–5.7, 6.1–
6.3, 7.1–7.3, 8.1–8.7, 12.1–12.4, 13.1–13.9, 15.4, and 17.1–17.4; additional
parts of the book can be used as appropriate.
Software
Software for computing the techniques described in this book is publicly
available either through routines in major computer packages or through
download from Internet websites. I have used primarily the R, S-Plus, and
Matlab packages in writing this book. In the Software Packages section at
the ends of certain chapters, I have listed the relevant R/S-Plus routines
for the respective chapter as well as the appropriate toolboxes in Matlab.
I have also tried to indicate other major packages wherever relevant.
Data Sets
The many data sets that illustrate the multivariate techniques presented
in this book were obtained from a wide variety of sources and disciplines and
will be made available through the book’s website. Disciplines from which
the data were obtained include astronomy, bioinformatics, botany, chemo-
metrics, criminology, food science, forensic science, genetics, geoscience,
medicine, philately, physical anthropology, psychology, soil science, sports,
and steganography. Part of the learning process for the reader is to become
familiar with the classic data sets that are associated with each technique.
In particular, data sets from popular data repositories are used to compare
and contrast methodologies. Examples in the book involve small data sets
(if a particular point or computation needs clarifying) and large data sets
(to see the power of the techniques in question).
Exercises
At the end of every chapter (except Chapter 1), there is a number of
exercises designed to make the reader (a) relate the problem to the text and
fill in the technical details omitted in the development of certain techniques,