Gaussian Processes for
Machine Learning
Carl Edward Rasmussen and Christopher K. I. Williams
Gaussian Processes for Machine Learning
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors
Bioinformatics: The Machine Learning Approach,
Pierre Baldi and Søren Brunak
Reinforcement Learning: An Introduction,
Richard S. Sutton and Andrew G. Barto
Graphical Models for Machine Learning and Digital Communication,
Brendan J. Frey
Learning in Graphical Models,
Michael I. Jordan
Causation, Prediction, and Search, second edition,
Peter Spirtes, Clark Glymour, and Richard Scheines
Principles of Data Mining,
David Hand, Heikki Mannila, and Padhraic Smyth
Bioinformatics: The Machine Learning Approach, second edition,
Pierre Baldi and Søren Brunak
Learning Kernel Classifiers: Theory and Algorithms,
Ralf Herbrich
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond,
Bernhard Sch¨olkopf and Alexander J. Smola
Introduction to Machine Learning,
Ethem Alpaydin
Gaussian Processes for Machine Learning,
Carl Edward Rasmussen and Christopher K. I. Williams
Gaussian Processes for Machine Learning
Carl Edward Rasmussen
Christopher K. I. Williams
The MIT Press
Cambridge, Massachusetts
London, England
c 2006 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical
means (including photocopying, recording, or information storage and retrieval) without permission in
writing from the publisher.
MIT Press books may be purchased at special quantity discounts for business or sales promotional use.
For information, please email special sales@mitpress.mit.edu or write to Special Sales Department,
The MIT Press, 55 Hayward Street, Cambridge, MA 02142.
Typeset by the authors using LATEX 2ε.
This book printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Rasmussen, Carl Edward.
Gaussian processes for machine learning / Carl Edward Rasmussen, Christopher K. I. Williams.
p. cm. —(Adaptive computation and machine learning)
Includes bibliographical references and indexes.
ISBN 0-262-18253-X
1. Gaussian processes—Data processing. 2. Machine learning—Mathematical models.
I. Williams, Christopher K. I. II. Title. III. Series.
QA274.4.R37 2006
519.2’3—dc22
10
9
8
7
6 5 4 3 2 1
2005053433
The actual science of logic is conversant at present only with things either
certain, impossible, or entirely doubtful, none of which (fortunately) we have to
reason on. Therefore the true logic for this world is the calculus of Probabilities,
which takes account of the magnitude of the probability which is, or ought to
be, in a reasonable man’s mind.
— James Clerk Maxwell [1850]
Contents
xi
Series Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Preface
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Symbols and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Introduction
1.1 A Pictorial Introduction to Bayesian Modelling . . . . . . . . . . . . . . .
1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Regression
2.1 Weight-space View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 The Standard Linear Model . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Projections of Inputs into Feature Space . . . . . . . . . . . . . . .
2.2 Function-space View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Varying the Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Decision Theory for Regression . . . . . . . . . . . . . . . . . . . . . . . .
2.5 An Example Application . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Smoothing, Weight Functions and Equivalent Kernels
. . . . . . . . . . .
Incorporating Explicit Basis Functions . . . . . . . . . . . . . . . . . . . .
2.7.1 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 History and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Exercises
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
∗ 2.7
1
3
5
7
7
8
11
13
19
21
22
24
27
29
29
30
3 Classification
3.1 Classification Problems
33
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.1.1 Decision Theory for Classification . . . . . . . . . . . . . . . . . .
35
3.2 Linear Models for Classification . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3 Gaussian Process Classification . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4 The Laplace Approximation for the Binary GP Classifier . . . . . . . . . .
41
3.4.1 Posterior
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.4.2 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4.3
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.4.4 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . .
47
∗ 3.5 Multi-class Laplace Approximation . . . . . . . . . . . . . . . . . . . . . .
48
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.6 Expectation Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.6.1 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.6.2 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.6.3
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.7.1 A Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.7.2 One-dimensional Example
. . . . . . . . . . . . . . . . . . . . . .
62
3.7.3 Binary Handwritten Digit Classification Example . . . . . . . . . .
63
3.7.4
10-class Handwritten Digit Classification Example . . . . . . . . .
70
3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
∗Sections marked by an asterisk contain advanced material that may be omitted on a first reading.
3.5.1