Springer Texts in Statistics
Series Editors:
G. Casella
S. Fienberg
I. Olkin
Springer Texts in Statistics
For other titles published in this series, go to
www.springer.com/series/417
Alan Julian Izenman
Modern Multivariate
Statistical Techniques
Regression, Classification,
and Manifold Learning
123
Alan J. Izenman
Department of Statistics
Temple University
Speakman Hall
Philadelphia, PA 19122
USA
alan@temple.edu
Editorial Board
George Casella
Department of Statistics
University of Florida
Gainesville,
8545
USA
FL 32611-
Stephen Fienberg
Department of Statistics
Carnegie Mellon University
Pittsburgh, PA 15213-3890
USA
l
Ingram O kin
Department of Statistics
Stanford University
Stanford, CA 94305
USA
ISSN: 1431-875X
ISBN: 978-0-387-78188-4
DOI: 10.1007/978-0-387-78189-1
e-ISBN: 978-0-387-78189-1
Library of Congress Control Number: 2008928720.
c 2008 Springer Science+Business Media, LLC
All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use
in connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they
are not identified as such, is not to be taken as an expression of opinion as to whether or not they are
subject to proprietary rights.
Printed on acid-free paper
springer.com
This book is dedicated
to the memory of my parents,
Kitty and Larry,
and to my family,
Betty-Ann and Kayla
Preface
Not so long ago, multivariate analysis consisted solely of linear methods
illustrated on small to medium-sized data sets. Moreover, statistical com-
puting meant primarily batch processing (often using boxes of punched
cards) carried out on a mainframe computer at a remote computer facil-
ity. During the 1970s, interactive computing was just beginning to raise its
head, and exploratory data analysis was a new idea. In the decades since
then, we have witnessed a number of remarkable developments in local
computing power and data storage. Huge quantities of data are being col-
lected, stored, and efficiently managed, and interactive statistical software
packages enable sophisticated data analyses to be carried out effortlessly.
These advances enabled new disciplines called data mining and machine
learning to be created and developed by researchers in computer science
and statistics.
As enormous data sets become the norm rather than the exception, sta-
tistics as a scientific discipline is changing to keep up with this development.
Instead of the traditional heavy reliance on hypothesis testing, attention
is now being focused on information or knowledge discovery. Accordingly,
some of the recent advances in multivariate analysis include techniques
from computer science, artificial intelligence, and machine learning theory.
Many of these new techniques are still in their infancy, waiting for statistical
theory to catch up.
The origins of some of these techniques are purely algorithmic, whereas
the more traditional techniques were derived through modeling, optimiza-
viii
Preface
tion, or probabilistic reasoning. As such algorithmic techniques mature, it
becomes necessary to build a solid statistical framework within which to
embed them. In some instances, it may not be at all obvious why a partic-
ular technique (such as a complex algorithm) works as well as it does:
When new ideas are being developed, the most fruitful approach
is often to let rigor rest for a while, and let intuition reign — at
least in the beginning. New methods may require new concepts
and new approaches, in extreme cases even a new language, and
it may then be impossible to describe such ideas precisely in the
old language.
— Inge S. Helland, 2000
It is hoped that this book will be enjoyed by those who wish to under-
stand the current state of multivariate statistical analysis in an age of high-
speed computation and large data sets. This book mixes new algorithmic
techniques for analyzing large multivariate data sets with some of the more
classical multivariate techniques. Yet, even the classical methods are not
given only standard treatments here; many of them are also derived as spe-
cial cases of a common theoretical framework (multivariate reduced-rank
regression) rather than separately through different approaches. Another
major feature of this book is the novel data sets that are used as examples
to illustrate the techniques.
I have included as much statistical theory as I believed is necessary to
understand the development of ideas, plus details of certain computational
algorithms; historical notes on the various topics have also been added
wherever possible (usually in the Bibliographical Notes at the end of each
chapter) to help the reader gain some perspective on the subject matter.
References at the end of the book should be considered as extensive without
being exhaustive.
Some common abbreviations used in this book should be noted: “iid”
means independently and identically distributed; “wrt” means with respect
to; and “lhs” and “rhs” mean left- and right-hand side, respectively.
Audience
This book is directed toward advanced undergraduate students, gradu-
ate students, and researchers in statistics, computer science, artificial in-
telligence, psychology, neural and cognitive sciences, business, medicine,
bioinformatics, and engineering. As prerequisites, readers are expected to
have had previous knowledge of probability, statistical theory and methods,
multivariable calculus, and linear/matrix algebra. Because vectors and ma-
trices play such a major role in multivariate analysis, Chapter 3 gives the
matrix notation used in the book and many important advanced concepts
in matrix theory. Along with a background in classical statistical theory