logo资料库

Pattern_Recognition_and_Machine_Learning (Bishop 2006).pdf

第1页 / 共758页
第2页 / 共758页
第3页 / 共758页
第4页 / 共758页
第5页 / 共758页
第6页 / 共758页
第7页 / 共758页
第8页 / 共758页
资料共758页,剩余部分请下载后查看
COVER
Preface
Mathematical notation
Contents
1. Introduction
1.1. Example: Polynomial Curve Fitting
1.2. Probability Theory
1.2.1 Probability densities
1.2.2 Expectations and covariances
1.2.3 Bayesian probabilities
1.2.4 The Gaussian distribution
1.2.5 Curve fitting re-visited
1.2.6 Bayesian curve fitting
1.3. Model Selection
1.4. The Curse of Dimensionality
1.5. Decision Theory
1.5.1 Minimizing the misclassification rate
1.5.2 Minimizing the expected loss
1.5.3 The reject option
1.5.4 Inference and decision
1.5.5 Loss functions for regression
1.6. Information Theory
1.6.1 Relative entropy and mutual information
Exercises
2. Probability Distributions
2.1. Binary Variables
2.1.1 The beta distribution
2.2. Multinomial Variables
2.2.1 The Dirichlet distribution
2.3. The Gaussian Distribution
2.3.1 Conditional Gaussian distributions
2.3.2 Marginal Gaussian distributions
2.3.3 Bayes’ theorem for Gaussian variables
2.3.4 Maximum likelihood for the Gaussian
2.3.5 Sequential estimation
2.3.6 Bayesian inference for the Gaussian
2.3.7 Student’s t-distribution
2.3.8 Periodic variables
2.3.9 Mixtures of Gaussians
2.4. The Exponential Family
2.4.1 Maximum likelihood and sufficient statistics
2.4.2 Conjugate priors
2.4.3 Noninformative priors
2.5. Nonparametric Methods
2.5.1 Kernel density estimators
2.5.2 Nearest-neighbour methods
Exercises
3. Linear Models for Regression
3.1. Linear Basis Function Models
3.1.1 Maximum likelihood and least squares
3.1.2 Geometry of least squares
3.1.3 Sequential learning
3.1.4 Regularized least squares
3.1.5 Multiple outputs
3.2. The Bias-Variance Decomposition
3.3. Bayesian Linear Regression
3.3.1 Parameter distribution
3.3.2 Predictive distribution
3.3.3 Equivalent kernel
3.4. Bayesian Model Comparison
3.5. The Evidence Approximation
3.5.1 Evaluation of the evidence function
3.5.2 Maximizing the evidence function
3.5.3 Effective number of parameters
3.6. Limitations of Fixed Basis Functions
Exercises
4. Linear Models for Classification
4.1. Discriminant Functions
4.1.1 Two classes
4.1.2 Multiple classes
4.1.3 Least squares for classification
4.1.4 Fisher’s linear discriminant
4.1.5 Relation to least squares
4.1.6 Fisher’s discriminant for multiple classes
4.1.7 The perceptron algorithm
4.2. Probabilistic Generative Models
4.2.1 Continuous inputs
4.2.2 Maximum likelihood solution
4.2.3 Discrete features
4.2.4 Exponential family
4.3. Probabilistic Discriminative Models
4.3.1 Fixed basis functions
4.3.2 Logistic regression
4.3.3 Iterative reweighted least squares
4.3.4 Multiclass logistic regression
4.3.5 Probit regression
4.3.6 Canonical link functions
4.4. The Laplace Approximation
4.4.1 Model comparison and BIC
4.5. Bayesian Logistic Regression
4.5.1 Laplace approximation
4.5.2 Predictive distribution
Exercises
5. Neural Networks
5.1. Feed-forward Network Functions
5.1.1 Weight-space symmetries
5.2. Network Training
5.2.1 Parameter optimization
5.2.2 Local quadratic approximation
5.2.3 Use of gradient information
5.2.4 Gradient descent optimization
5.3. Error Backpropagation
5.3.1 Evaluation of error-function derivatives
5.3.2 A simple example
5.3.3 Efficiency of backpropagation
5.3.4 The Jacobian matrix
5.4. The Hessian Matrix
5.4.1 Diagonal approximation
5.4.2 Outer product approximation
5.4.3 Inverse Hessian
5.4.4 Finite differences
5.4.5 Exact evaluation of the Hessian
5.4.6 Fast multiplication by the Hessian
5.5. Regularization in Neural Networks
5.5.1 Consistent Gaussian priors
5.5.2 Early stopping
5.5.3 Invariances
5.5.4 Tangent propagation
5.5.5 Training with transformed data
5.5.6 Convolutional networks
5.5.7 Soft weight sharing
5.6. Mixture Density Networks
5.7. Bayesian Neural Networks
5.7.1 Posterior parameter distribution
5.7.2 Hyperparameter optimization
5.7.3 Bayesian neural networks for classification
Exercises
6. Kernel Methods
6.1. Dual Representations
6.2. Constructing Kernels
6.3. Radial Basis Function Networks
6.3.1 Nadaraya-Watson model
6.4. Gaussian Processes
6.4.1 Linear regression revisited
6.4.2 Gaussian processes for regression
6.4.3 Learning the hyperparameters
6.4.4 Automatic relevance determination
6.4.5 Gaussian processes for classification
6.4.6 Laplace approximation
6.4.7 Connection to neural networks
Exercises
7. Sparse Kernel Machines
7.1. Maximum Margin Classifiers
7.1.1 Overlapping class distributions
7.1.2 Relation to logistic regression
7.1.3 Multiclass SVMs
7.1.4 SVMs for regression
7.1.5 Computational learning theory
7.2. Relevance Vector Machines
7.2.1 RVM for regression
7.2.2 Analysis of sparsity
7.2.3 RVM for classification
8. Graphical Models
8.1. Bayesian Networks
8.1.1 Example: Polynomial regression
8.1.2 Generative models
8.1.3 Discrete variables
8.1.4 Linear-Gaussian models
8.2. Conditional Independence
8.2.1 Three example graphs
8.2.2 D-separation
8.3. Markov Random Fields
8.3.1 Conditional independence properties
8.3.2 Factorization properties
8.3.3 Illustration: Image de-noising
8.3.4 Relation to directed graphs
8.4. Inference in Graphical Models
8.4.1 Inference on a chain
8.4.2 Trees
8.4.3 Factor graphs
8.4.4 The sum-product algorithm
8.4.5 The max-sum algorithm
8.4.6 Exact inference in general graphs
8.4.7 Loopy belief propagation
8.4.8 Learning the graph structure
Exercises
9. Mixture Models and EM
9.1. K-means Clustering
9.1.1 Image segmentation and compression
9.2. Mixtures of Gaussians
9.2.1 Maximum likelihood
9.2.2 EM for Gaussian mixtures
9.3. An Alternative View of EM
9.3.1 Gaussian mixtures revisited
9.3.2 Relation to K-means
9.3.3 Mixtures of Bernoulli distributions
9.3.4 EM for Bayesian linear regression
9.4. The EM Algorithm in General
Exercises
10. Approximate Inference
10.1. Variational Inference
10.1.1 Factorized distributions
10.1.2 Properties of factorized approximations
10.1.3 Example: The univariate Gaussian
10.1.4 Model comparison
10.2. Illustration: Variational Mixture of Gaussians
10.2.1 Variational distribution
10.2.2 Variational lower bound
10.2.3 Predictive density
10.2.4 Determining the number of components
10.2.5 Induced factorizations
10.3. Variational Linear Regression
10.3.1 Variational distribution
10.3.2 Predictive distribution
10.3.3 Lower bound
10.4. Exponential Family Distributions
10.4.1 Variational message passing
10.5. Local Variational Methods
10.6. Variational Logistic Regression
10.6.1 Variational posterior distribution
10.6.2 Optimizing the variational parameters
10.6.3 Inference of hyperparameters
10.7. Expectation Propagation
10.7.1 Example: The clutter problem
10.7.2 Expectation propagation on graphs
Exercises
11. Sampling Methods
11.1. Basic Sampling Algorithms
11.1.1 Standard distributions
11.1.2 Rejection sampling
11.1.3 Adaptive rejection sampling
11.1.4 Importance sampling
11.1.5 Sampling-importance-resampling
11.1.6 Sampling and the EM algorithm
11.2. Markov Chain Monte Carlo
11.2.1 Markov chains
11.2.2 The Metropolis-Hastings algorithm
11.3. Gibbs Sampling
11.4. Slice Sampling
11.5. The Hybrid Monte Carlo Algorithm
11.5.1 Dynamical systems
11.5.2 Hybrid Monte Carlo
11.6. Estimating the Partition Function
Exercises
12. Continuous Latent Variables
12.1. Principal Component Analysis
12.1.1 Maximum variance formulation
12.1.2 Minimum-error formulation
12.1.3 Applications of PCA
12.1.4 PCA for high-dimensional data
12.2. Probabilistic PCA
12.2.1 Maximum likelihood PCA
12.2.2 EM algorithm for PCA
12.2.3 Bayesian PCA
12.2.4 Factor analysis
12.3. Kernel PCA
12.4. Nonlinear Latent Variable Models
12.4.1 Independent component analysis
12.4.2 Autoassociative neural networks
12.4.3 Modelling nonlinear manifolds
Exercises
13. Sequential Data
13.1. Markov Models
13.2. Hidden Markov Models
13.2.1 Maximum likelihood for the HMM
13.2.2 The forward-backward algorithm
13.2.3 The sum-product algorithm for the HMM
13.2.4 Scaling factors
13.2.5 The Viterbi algorithm
13.2.6 Extensions of the hidden Markov model
13.3. Linear Dynamical Systems
13.3.1 Inference in LDS
13.3.2 Learning in LDS
13.3.3 Extensions of LDS
13.3.4 Particle filters
Exercises
14. Combining Models
14.1. Bayesian Model Averaging
14.2. Committees
14.3. Boosting
14.3.1 Minimizing exponential error
14.3.2 Error functions for boosting
14.4. Tree-based Models
14.5. Conditional Mixture Models
14.5.1 Mixtures of linear regression models
14.5.2 Mixtures of logistic models
14.5.3 Mixtures of experts
Exercises
Appendix A. Data Sets
Appendix B. Probability Distributions
Appendix C. Properties of Matrices
Appendix D. Calculus of Variations
Appendix E. Lagrange Multipliers
References
Index
Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Scho¨lkopf
Information Science and Statistics Akaike and Kitagawa: The Practice of Time Series Analysis. Bishop: Pattern Recognition and Machine Learning. Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and Expert Systems. Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice. Fine: Feedforward Neural Network Methodology. Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement. Jensen: Bayesian Networks and Decision Graphs. Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical Viewpoint. Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning. Studený: Probabilistic Conditional Independence Structures. Vapnik: The Nature of Statistical Learning Theory, Second Edition. Wallace: Statistical and Inductive Inference by Minimum Massage Length.
Christopher M. Bishop Pattern Recognition and Machine Learning
Christopher M. Bishop F.R.Eng. Assistant Director Microsoft Research Ltd Cambridge CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/⬃cmbishop Series Editors Michael Jordan Department of Computer Science and Department of Statistics University of California, Berkeley Berkeley, CA 94720 USA Professor Jon Kleinberg Department of Computer Science Cornell University Ithaca, NY 14853 USA Bernhard Scho¨lkopf Max Planck Institute for Biological Cybernetics Spemannstrasse 38 72076 Tu¨bingen Germany Library of Congress Control Number: 2006922522 ISBN-10: 0-387-31073-8 ISBN-13: 978-0387-31073-2 Printed on acid-free paper. © 2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in Singapore. (KYO) 9 8 7 6 5 4 3 2 1 springer.com
This book is dedicated to my family: Jenna, Mark, and Hugh Total eclipse of the sun, Antalya, Turkey, 29 March 2006.
Preface Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same field, and together they have undergone substantial development over the past ten years. In particular, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic models. Also, the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propa- gation. Similarly, new models based on kernels have had significant impact on both algorithms and applications. This new textbook reflects these recent developments while providing a compre- hensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first year PhD students, as well as researchers and practitioners, and assumes no previous knowledge of pattern recognition or ma- chine learning concepts. Knowledge of multivariate calculus and basic linear algebra is required, and some familiarity with probabilities would be helpful though not es- sential as the book includes a self-contained introduction to basic probability theory. Because this book has broad scope, it is impossible to provide a complete list of references, and in particular no attempt has been made to provide accurate historical attribution of ideas. Instead, the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what, in some cases, is a very extensive literature. For this reason, the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material, including lecture slides as well as the complete set of figures used in the book, and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/∼cmbishop/PRML vii
viii PREFACE Exercises The exercises that appear at the end of every chapter form an important com- ponent of the book. Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in significant ways, and each is graded according to difficulty ranging from (), which denotes a simple exercise taking a few minutes to complete, through to ( ), which denotes a significantly more complex exercise. It has been difficult to know to what extent these solutions should be made widely available. Those engaged in self study will find worked solutions very ben- eficial, whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class. In order to try to meet these conflicting requirements, those exercises that help amplify key points in the text, or that fill in important details, have solutions that are available as a PDF file from the book web site. Such exercises are denoted by www . Solutions for the remaining exercises are available to course tutors by contacting the publisher (contact details are given on the book web site). Readers are strongly encouraged to work through the exercises unaided, and to turn to the solutions only as required. Although this book focuses on concepts and principles, in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets. A companion volume (Bishop and Nabney, 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements First of all I would like to express my sincere thanks to Markus Svens´en who has provided immense help with preparation of figures and with the typesetting of the book in LATEX. His assistance has been invaluable. I am very grateful to Microsoft Research for providing a highly stimulating re- search environment and for giving me the freedom to write this book (the views and opinions expressed in this book, however, are my own and are therefore not neces- sarily the same as those of Microsoft or its affiliates). Springer has provided excellent support throughout the final stages of prepara- tion of this book, and I would like to thank my commissioning editor John Kimmel for his support and professionalism, as well as Joseph Piliero for his help in design- ing the cover and the text format and MaryAnn Brickner for her numerous contribu- tions during the production phase. The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce ex- cerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop, 1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are repro- duced with the permission of Arvin Calspan Advanced Technology Center. I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, and Bernhard Sch¨olkopf for permission to use his kernel PCA code to plot Fig- ure 12.17.
分享到:
收藏