Optimization for Machine Learning
Neural Information Processing Series
Michael I. Jordan and Thomas Dietterich, editors
Advances in Large Margin Classifiers, Alexander J. Smola, Peter L. Bartlett,
Bernhard Sch¨olkopf, and Dale Schuurmans, eds., 2000
Advanced Mean Field Methods: Theory and Practice, Manfred Opper and
David Saad, eds., 2001
Probabilistic Models of the Brain: Perception and Neural Function, Rajesh
P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki, eds., 2002
Exploratory Analysis and Data Modeling in Functional Neuroimaging,
Friedrich T. Sommer and Andrzej Wichert, eds., 2003
Advances in Minimum Description Length: Theory and Applications, Peter
D. Gr¨unwald, In Jae Myung, and Mark A. Pitt, eds., 2005
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice,
Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell, eds., 2006
New Directions in Statistical Signal Processing: From Systems to Brains, Si-
mon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter,
eds., 2007
Predicting Structured Data, G¨okhan BakIr, Thomas Hofmann, Bernhard
Sch¨olkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan,
eds., 2007
Toward Brain-Computer Interfacing, Guido Dornhege, Jos´e del R. Mill´an,
Thilo Hinterberger, Dennis J. McFarland, and Klaus-Robert M¨uller, eds.,
2007
Large-Scale Kernel Machines, L´eon Bottou, Olivier Chapelle, Denis De-
Coste, and Jason Weston, eds., 2007
Learning Machine Translation, Cyril Goutte, Nicola Cancedda, Marc
Dymetman, and George Foster, eds., 2009
Dataset Shift in Machine Learning, Joaquin Qui˜nonero-Candela, Masashi
Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds., 2009
Optimization for Machine Learning, Suvrit Sra, Sebastian Nowozin, and
Stephen J. Wright, eds., 2012
Optimization for Machine Learning
Edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright
The MIT Press
Cambridge, Massachusetts
London, England
© 2012 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any
electronic or mechanical means (including photocopying, recording, or
information storage and retrieval) without permission in writing from the
publisher.
For information about special quantity discounts, please email
special_sales@mitpress.mit.edu
This book was set in LaTeX by the authors and editors. Printed and bound in the
United States of America.
Library of Congress Cataloging-in-Publication Data
Optimization for machine learning / edited by Suvrit Sra, Sebastian Nowozin, and
Stephen J. Wright.
p. cm. — (Neural information processing series)
Includes bibliographical references.
ISBN 978-0-262-01646-9 (hardcover : alk. paper) 1. Machine learning—
Mathematical models. 2. Mathematical optimization. I. Sra, Suvrit, 1976– II.
Nowozin, Sebastian, 1980– III. Wright, Stephen J., 1960–
Q325.5.O65 2012
006.3'1—c22
2011002059
10 9 8 7 6 5 4 3 2 1
Contents
Series Foreword
Preface
xi
xiii
1 Introduction: Optimization and Machine Learning
S. Sra, S. Nowozin, and S. J. Wright
1.1 Support Vector Machines
. . . . . . . . . . . . . . . . . . . .
1.2 Regularized Optimization . . . . . . . . . . . . . . . . . . . .
1.3 Summary of the Chapters . . . . . . . . . . . . . . . . . . . .
1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Convex Optimization with Sparsity-Inducing Norms
(Block) Coordinate Descent Algorithms
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Generic Methods . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
2.3 Proximal Methods
2.4
. . . . . . . . . . . .
2.5 Reweighted-2 Algorithms . . . . . . . . . . . . . . . . . . . .
2.6 Working-Set Methods
. . . . . . . . . . . . . . . . . . . . . .
2.7 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . .
2.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Interior-Point Methods for Large-Scale Cone Programming
M. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
3.2 Primal-Dual Interior-Point Methods
. . . . . . . . . . . . . .
3.3 Linear and Quadratic Programming . . . . . . . . . . . . . .
3.4 Second-Order Cone Programming . . . . . . . . . . . . . . . .
3.5 Semidefinite Programming . . . . . . . . . . . . . . . . . . . .
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
7
11
15
19
19
26
27
32
34
36
40
47
48
49
55
56
60
64
71
74
79
vi
3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4 Incremental Gradient, Subgradient, and Proximal Methods
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Incremental Subgradient-Proximal Methods . . . . . . . . . .
for Convex Optimization: A Survey
85
D. P. Bertsekas
86
4.1
4.2
98
4.3 Convergence for Methods with Cyclic Order . . . . . . . . . . 102
4.4 Convergence for Methods with Randomized Order
. . . . . . 108
4.5 Some Applications . . . . . . . . . . . . . . . . . . . . . . . . 111
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5 First-Order Methods for Nonsmooth Convex Large-Scale
Optimization, I: General Purpose Methods
121
A. Juditsky and A. Nemirovski
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1
5.2 Mirror Descent Algorithm: Minimizing over a Simple Set . . . 126
5.3 Problems with Functional Constraints . . . . . . . . . . . . . 130
5.4 Minimizing Strongly Convex Functions . . . . . . . . . . . . . 131
5.5 Mirror Descent Stochastic Approximation . . . . . . . . . . . 134
5.6 Mirror Descent for Convex-Concave Saddle-Point Problems
. 135
5.7 Setting up a Mirror Descent Method . . . . . . . . . . . . . . 139
5.8 Notes and Remarks . . . . . . . . . . . . . . . . . . . . . . . . 145
5.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6 First-Order Methods for Nonsmooth Convex Large-Scale
Optimization, II: Utilizing Problem’s Structure
149
A. Juditsky and A. Nemirovski
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Saddle-Point Reformulations of Convex Minimization Problems151
6.3 Mirror-Prox Algorithm . . . . . . . . . . . . . . . . . . . . . . 154
6.4 Accelerating the Mirror-Prox Algorithm . . . . . . . . . . . . 160
6.5 Accelerating First-Order Methods by Randomization . . . . . 171
6.6 Notes and Remarks . . . . . . . . . . . . . . . . . . . . . . . . 179
6.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7 Cutting-Plane Methods in Machine Learning
Introduction to Cutting-plane Methods
185
V. Franc, S. Sonnenburg, and T. Werner
7.1
. . . . . . . . . . . . 187
7.2 Regularized Risk Minimization . . . . . . . . . . . . . . . . . 191
7.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . 197
vii
7.4 MAP Inference in Graphical Models
. . . . . . . . . . . . . . 203
7.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
8 Introduction to Dual Decomposition for Inference
219
D. Sontag, A. Globerson, and T. Jaakkola
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
8.1
8.2 Motivating Applications . . . . . . . . . . . . . . . . . . . . . 222
8.3 Dual Decomposition and Lagrangian Relaxation . . . . . . . 224
8.4 Subgradient Algorithms . . . . . . . . . . . . . . . . . . . . . 229
8.5 Block Coordinate Descent Algorithms
. . . . . . . . . . . . . 232
8.6 Relations to Linear Programming Relaxations . . . . . . . . . 240
8.7 Decoding: Finding the MAP Assignment . . . . . . . . . . . . 242
8.8 Discussion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9 Augmented Lagrangian Methods for Learning, Selecting,
and Combining Features
255
R. Tomioka, T. Suzuki, and M. Sugiyama
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
9.3 Proximal Minimization Algorithm . . . . . . . . . . . . . . . 263
9.4 Dual Augmented Lagrangian (DAL) Algorithm . . . . . . . . 265
9.5 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
9.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
9.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
10 The Convex Optimization Approach to Regret
Minimization
287
E. Hazan
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
10.2 The RFTL Algorithm and Its Analysis . . . . . . . . . . . . . 291
10.3 The “Primal-Dual” Approach . . . . . . . . . . . . . . . . . . 294
10.4 Convexity of Loss Functions . . . . . . . . . . . . . . . . . . . 298
10.5 Recent Applications
. . . . . . . . . . . . . . . . . . . . . . . 300
10.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
11 Projected Newton-type Methods in Machine Learning
305
M. Schmidt, D. Kim, and S. Sra
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
. . . . . . . . . . . . . . . . 306
11.2 Projected Newton-type Methods
11.3 Two-Metric Projection Methods
. . . . . . . . . . . . . . . . 312