Pattern_Recognition_and_Machine_Learning (Bishop 2006).pdf

发布时间：2022-06-09 发布人：admin 分类：说明书资料大小：17.26M 资料格式：pdf 举报版权申诉

pingan_china-5123557-4744302543356409687.pdf-第1页.png

第1页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第2页.png

第2页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第3页.png

第3页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第4页.png

第4页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第5页.png

第5页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第6页.png

第6页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第7页.png

第7页 / 共758页

pingan_china-5123557-4744302543356409687.pdf-第8页.png

第8页 / 共758页

COVER

Preface

Mathematical notation

Contents

1. Introduction

1.1. Example: Polynomial Curve Fitting

1.2. Probability Theory

1.2.1 Probability densities

1.2.2 Expectations and covariances

1.2.3 Bayesian probabilities

1.2.4 The Gaussian distribution

1.2.5 Curve fitting re-visited

1.2.6 Bayesian curve fitting

1.3. Model Selection

1.4. The Curse of Dimensionality

1.5. Decision Theory

1.5.1 Minimizing the misclassification rate

1.5.2 Minimizing the expected loss

1.5.3 The reject option

1.5.4 Inference and decision

1.5.5 Loss functions for regression

1.6. Information Theory

1.6.1 Relative entropy and mutual information

Exercises

2. Probability Distributions

2.1. Binary Variables

2.1.1 The beta distribution

2.2. Multinomial Variables

2.2.1 The Dirichlet distribution

2.3. The Gaussian Distribution

2.3.1 Conditional Gaussian distributions

2.3.2 Marginal Gaussian distributions

2.3.3 Bayes’ theorem for Gaussian variables

2.3.4 Maximum likelihood for the Gaussian

2.3.5 Sequential estimation

2.3.6 Bayesian inference for the Gaussian

2.3.7 Student’s t-distribution

2.3.8 Periodic variables

2.3.9 Mixtures of Gaussians

2.4. The Exponential Family

2.4.1 Maximum likelihood and sufficient statistics

2.4.2 Conjugate priors

2.4.3 Noninformative priors

2.5. Nonparametric Methods

2.5.1 Kernel density estimators

2.5.2 Nearest-neighbour methods

Exercises

3. Linear Models for Regression

3.1. Linear Basis Function Models

3.1.1 Maximum likelihood and least squares

3.1.2 Geometry of least squares

3.1.3 Sequential learning

3.1.4 Regularized least squares

3.1.5 Multiple outputs

3.2. The Bias-Variance Decomposition

3.3. Bayesian Linear Regression

3.3.1 Parameter distribution

3.3.2 Predictive distribution

3.3.3 Equivalent kernel

3.4. Bayesian Model Comparison

3.5. The Evidence Approximation

3.5.1 Evaluation of the evidence function

3.5.2 Maximizing the evidence function

3.5.3 Effective number of parameters

3.6. Limitations of Fixed Basis Functions

Exercises

4. Linear Models for Classification

4.1. Discriminant Functions

4.1.1 Two classes

4.1.2 Multiple classes

4.1.3 Least squares for classification

4.1.4 Fisher’s linear discriminant

4.1.5 Relation to least squares

4.1.6 Fisher’s discriminant for multiple classes

4.1.7 The perceptron algorithm

4.2. Probabilistic Generative Models

4.2.1 Continuous inputs

4.2.2 Maximum likelihood solution

4.2.3 Discrete features

4.2.4 Exponential family

4.3. Probabilistic Discriminative Models

4.3.1 Fixed basis functions

4.3.2 Logistic regression

4.3.3 Iterative reweighted least squares

4.3.4 Multiclass logistic regression

4.3.5 Probit regression

4.3.6 Canonical link functions

4.4. The Laplace Approximation

4.4.1 Model comparison and BIC

4.5. Bayesian Logistic Regression

4.5.1 Laplace approximation

4.5.2 Predictive distribution

Exercises

5. Neural Networks

5.1. Feed-forward Network Functions

5.1.1 Weight-space symmetries

5.2. Network Training

5.2.1 Parameter optimization

5.2.2 Local quadratic approximation

5.2.3 Use of gradient information

5.2.4 Gradient descent optimization

5.3. Error Backpropagation

5.3.1 Evaluation of error-function derivatives

5.3.2 A simple example

5.3.3 Efficiency of backpropagation

5.3.4 The Jacobian matrix

5.4. The Hessian Matrix

5.4.1 Diagonal approximation

5.4.2 Outer product approximation

5.4.3 Inverse Hessian

5.4.4 Finite differences

5.4.5 Exact evaluation of the Hessian

5.4.6 Fast multiplication by the Hessian

5.5. Regularization in Neural Networks

5.5.1 Consistent Gaussian priors

5.5.2 Early stopping

5.5.3 Invariances

5.5.4 Tangent propagation

5.5.5 Training with transformed data

5.5.6 Convolutional networks

5.5.7 Soft weight sharing

5.6. Mixture Density Networks

5.7. Bayesian Neural Networks

5.7.1 Posterior parameter distribution

5.7.2 Hyperparameter optimization

5.7.3 Bayesian neural networks for classification

Exercises

6. Kernel Methods

6.1. Dual Representations

6.2. Constructing Kernels

6.3. Radial Basis Function Networks

6.3.1 Nadaraya-Watson model

6.4. Gaussian Processes

6.4.1 Linear regression revisited

6.4.2 Gaussian processes for regression

6.4.3 Learning the hyperparameters

6.4.4 Automatic relevance determination

6.4.5 Gaussian processes for classification

6.4.6 Laplace approximation

6.4.7 Connection to neural networks

Exercises

7. Sparse Kernel Machines

7.1. Maximum Margin Classifiers

7.1.1 Overlapping class distributions

7.1.2 Relation to logistic regression

7.1.3 Multiclass SVMs

7.1.4 SVMs for regression

7.1.5 Computational learning theory

7.2. Relevance Vector Machines

7.2.1 RVM for regression

7.2.2 Analysis of sparsity

7.2.3 RVM for classification

8. Graphical Models

8.1. Bayesian Networks

8.1.1 Example: Polynomial regression

8.1.2 Generative models

8.1.3 Discrete variables

8.1.4 Linear-Gaussian models

8.2. Conditional Independence

8.2.1 Three example graphs

8.2.2 D-separation

8.3. Markov Random Fields

8.3.1 Conditional independence properties

8.3.2 Factorization properties

8.3.3 Illustration: Image de-noising

8.3.4 Relation to directed graphs

8.4. Inference in Graphical Models

8.4.1 Inference on a chain

8.4.2 Trees

8.4.3 Factor graphs

8.4.4 The sum-product algorithm

8.4.5 The max-sum algorithm

8.4.6 Exact inference in general graphs

8.4.7 Loopy belief propagation

8.4.8 Learning the graph structure

Exercises

9. Mixture Models and EM

9.1. K-means Clustering

9.1.1 Image segmentation and compression

9.2. Mixtures of Gaussians

9.2.1 Maximum likelihood

9.2.2 EM for Gaussian mixtures

9.3. An Alternative View of EM

9.3.1 Gaussian mixtures revisited

9.3.2 Relation to K-means

9.3.3 Mixtures of Bernoulli distributions

9.3.4 EM for Bayesian linear regression

9.4. The EM Algorithm in General

Exercises

10. Approximate Inference

10.1. Variational Inference

10.1.1 Factorized distributions

10.1.2 Properties of factorized approximations

10.1.3 Example: The univariate Gaussian

10.1.4 Model comparison

10.2. Illustration: Variational Mixture of Gaussians

10.2.1 Variational distribution

10.2.2 Variational lower bound

10.2.3 Predictive density

10.2.4 Determining the number of components

10.2.5 Induced factorizations

10.3. Variational Linear Regression

10.3.1 Variational distribution

10.3.2 Predictive distribution

10.3.3 Lower bound

10.4. Exponential Family Distributions

10.4.1 Variational message passing

10.5. Local Variational Methods

10.6. Variational Logistic Regression

10.6.1 Variational posterior distribution

10.6.2 Optimizing the variational parameters

10.6.3 Inference of hyperparameters

10.7. Expectation Propagation

10.7.1 Example: The clutter problem

10.7.2 Expectation propagation on graphs

Exercises

11. Sampling Methods

11.1. Basic Sampling Algorithms

11.1.1 Standard distributions

11.1.2 Rejection sampling

11.1.3 Adaptive rejection sampling

11.1.4 Importance sampling

11.1.5 Sampling-importance-resampling

11.1.6 Sampling and the EM algorithm

11.2. Markov Chain Monte Carlo

11.2.1 Markov chains

11.2.2 The Metropolis-Hastings algorithm

11.3. Gibbs Sampling

11.4. Slice Sampling

11.5. The Hybrid Monte Carlo Algorithm

11.5.1 Dynamical systems

11.5.2 Hybrid Monte Carlo

11.6. Estimating the Partition Function

Exercises

12. Continuous Latent Variables

12.1. Principal Component Analysis

12.1.1 Maximum variance formulation

12.1.2 Minimum-error formulation

12.1.3 Applications of PCA

12.1.4 PCA for high-dimensional data

12.2. Probabilistic PCA

12.2.1 Maximum likelihood PCA

12.2.2 EM algorithm for PCA

12.2.3 Bayesian PCA

12.2.4 Factor analysis

12.3. Kernel PCA

12.4. Nonlinear Latent Variable Models

12.4.1 Independent component analysis

12.4.2 Autoassociative neural networks

12.4.3 Modelling nonlinear manifolds

Exercises

13. Sequential Data

13.1. Markov Models

13.2. Hidden Markov Models

13.2.1 Maximum likelihood for the HMM

13.2.2 The forward-backward algorithm

13.2.3 The sum-product algorithm for the HMM

13.2.4 Scaling factors

13.2.5 The Viterbi algorithm

13.2.6 Extensions of the hidden Markov model

13.3. Linear Dynamical Systems

13.3.1 Inference in LDS

13.3.2 Learning in LDS

13.3.3 Extensions of LDS

13.3.4 Particle filters

Exercises

14. Combining Models

14.1. Bayesian Model Averaging

14.2. Committees

14.3. Boosting

14.3.1 Minimizing exponential error

14.3.2 Error functions for boosting

14.4. Tree-based Models

14.5. Conditional Mixture Models

14.5.1 Mixtures of linear regression models

14.5.2 Mixtures of logistic models

14.5.3 Mixtures of experts

Exercises

Appendix A. Data Sets

Appendix B. Probability Distributions

Appendix C. Properties of Matrices

Appendix D. Calculus of Variations

Appendix E. Lagrange Multipliers

References

Index

Information Science and Statistics Series Editors: M. Jordan J. Kleinberg B. Scho¨lkopf

Information Science and Statistics Akaike and Kitagawa: The Practice of Time Series Analysis. Bishop: Pattern Recognition and Machine Learning. Cowell, Dawid, Lauritzen, and Spiegelhalter: Probabilistic Networks and Expert Systems. Doucet, de Freitas, and Gordon: Sequential Monte Carlo Methods in Practice. Fine: Feedforward Neural Network Methodology. Hawkins and Olwell: Cumulative Sum Charts and Charting for Quality Improvement. Jensen: Bayesian Networks and Decision Graphs. Marchette: Computer Intrusion Detection and Network Monitoring: A Statistical Viewpoint. Rubinstein and Kroese: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation, and Machine Learning. Studený: Probabilistic Conditional Independence Structures. Vapnik: The Nature of Statistical Learning Theory, Second Edition. Wallace: Statistical and Inductive Inference by Minimum Massage Length.

Christopher M. Bishop Pattern Recognition and Machine Learning

Christopher M. Bishop F.R.Eng. Assistant Director Microsoft Research Ltd Cambridge CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/⬃cmbishop Series Editors Michael Jordan Department of Computer Science and Department of Statistics University of California, Berkeley Berkeley, CA 94720 USA Professor Jon Kleinberg Department of Computer Science Cornell University Ithaca, NY 14853 USA Bernhard Scho¨lkopf Max Planck Institute for Biological Cybernetics Spemannstrasse 38 72076 Tu¨bingen Germany Library of Congress Control Number: 2006922522 ISBN-10: 0-387-31073-8 ISBN-13: 978-0387-31073-2 Printed on acid-free paper. © 2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in Singapore. (KYO) 9 8 7 6 5 4 3 2 1 springer.com

This book is dedicated to my family: Jenna, Mark, and Hugh Total eclipse of the sun, Antalya, Turkey, 29 March 2006.

Preface Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same ﬁeld, and together they have undergone substantial development over the past ten years. In particular, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic models. Also, the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propa- gation. Similarly, new models based on kernels have had signiﬁcant impact on both algorithms and applications. This new textbook reﬂects these recent developments while providing a compre- hensive introduction to the ﬁelds of pattern recognition and machine learning. It is aimed at advanced undergraduates or ﬁrst year PhD students, as well as researchers and practitioners, and assumes no previous knowledge of pattern recognition or ma- chine learning concepts. Knowledge of multivariate calculus and basic linear algebra is required, and some familiarity with probabilities would be helpful though not es- sential as the book includes a self-contained introduction to basic probability theory. Because this book has broad scope, it is impossible to provide a complete list of references, and in particular no attempt has been made to provide accurate historical attribution of ideas. Instead, the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what, in some cases, is a very extensive literature. For this reason, the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material, including lecture slides as well as the complete set of ﬁgures used in the book, and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/∼cmbishop/PRML vii

viii PREFACE Exercises The exercises that appear at the end of every chapter form an important com- ponent of the book. Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in signiﬁcant ways, and each is graded according to difﬁculty ranging from (), which denotes a simple exercise taking a few minutes to complete, through to ( ), which denotes a signiﬁcantly more complex exercise. It has been difﬁcult to know to what extent these solutions should be made widely available. Those engaged in self study will ﬁnd worked solutions very ben- eﬁcial, whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class. In order to try to meet these conﬂicting requirements, those exercises that help amplify key points in the text, or that ﬁll in important details, have solutions that are available as a PDF ﬁle from the book web site. Such exercises are denoted by www . Solutions for the remaining exercises are available to course tutors by contacting the publisher (contact details are given on the book web site). Readers are strongly encouraged to work through the exercises unaided, and to turn to the solutions only as required. Although this book focuses on concepts and principles, in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets. A companion volume (Bishop and Nabney, 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements First of all I would like to express my sincere thanks to Markus Svens´en who has provided immense help with preparation of ﬁgures and with the typesetting of the book in LATEX. His assistance has been invaluable. I am very grateful to Microsoft Research for providing a highly stimulating re- search environment and for giving me the freedom to write this book (the views and opinions expressed in this book, however, are my own and are therefore not neces- sarily the same as those of Microsoft or its afﬁliates). Springer has provided excellent support throughout the ﬁnal stages of prepara- tion of this book, and I would like to thank my commissioning editor John Kimmel for his support and professionalism, as well as Joseph Piliero for his help in design- ing the cover and the text format and MaryAnn Brickner for her numerous contribu- tions during the production phase. The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce ex- cerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop, 1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are repro- duced with the permission of Arvin Calspan Advanced Technology Center. I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, and Bernhard Sch¨olkopf for permission to use his kernel PCA code to plot Fig- ure 12.17.

分享到：

赞收藏

资料库

Pattern_Recognition_and_Machine_Learning (Bishop 2006).pdf

相关推荐

课程资源

热门标签

最新资料