logo资料库

MIT Optimization for Machine Learning.pdf

第1页 / 共509页
第2页 / 共509页
第3页 / 共509页
第4页 / 共509页
第5页 / 共509页
第6页 / 共509页
第7页 / 共509页
第8页 / 共509页
资料共509页,剩余部分请下载后查看
Contents
Series Foreword
Preface
Chapter 1. Introduction: Optimization and Machine Learning
1.1 Support Vector Machines
1.2 Regularized Optimization
1.3 Summary of the Chapters
1.4 References
Chapter 2. Convex Optimization with Sparsity-Inducing Norms
2.1 Introduction
2.2 Generic Methods
2.3 Proximal Methods
2.4 (Block) Coordinate Descent Algorithms
2.5 Reweighted-2 Algorithms
2.6 Working-Set Methods
2.7 Quantitative Evaluation
2.8 Extensions
2.9 Conclusion
2.10 References
Optimization for Machine Learning
Neural Information Processing Series Michael I. Jordan and Thomas Dietterich, editors Advances in Large Margin Classifiers, Alexander J. Smola, Peter L. Bartlett, Bernhard Sch¨olkopf, and Dale Schuurmans, eds., 2000 Advanced Mean Field Methods: Theory and Practice, Manfred Opper and David Saad, eds., 2001 Probabilistic Models of the Brain: Perception and Neural Function, Rajesh P. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki, eds., 2002 Exploratory Analysis and Data Modeling in Functional Neuroimaging, Friedrich T. Sommer and Andrzej Wichert, eds., 2003 Advances in Minimum Description Length: Theory and Applications, Peter D. Gr¨unwald, In Jae Myung, and Mark A. Pitt, eds., 2005 Nearest-Neighbor Methods in Learning and Vision: Theory and Practice, Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell, eds., 2006 New Directions in Statistical Signal Processing: From Systems to Brains, Si- mon Haykin, Jos´e C. Pr´ıncipe, Terrence J. Sejnowski, and John McWhirter, eds., 2007 Predicting Structured Data, G¨okhan BakIr, Thomas Hofmann, Bernhard Sch¨olkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan, eds., 2007 Toward Brain-Computer Interfacing, Guido Dornhege, Jos´e del R. Mill´an, Thilo Hinterberger, Dennis J. McFarland, and Klaus-Robert M¨uller, eds., 2007 Large-Scale Kernel Machines, L´eon Bottou, Olivier Chapelle, Denis De- Coste, and Jason Weston, eds., 2007 Learning Machine Translation, Cyril Goutte, Nicola Cancedda, Marc Dymetman, and George Foster, eds., 2009 Dataset Shift in Machine Learning, Joaquin Qui˜nonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds., 2009 Optimization for Machine Learning, Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, eds., 2012
Optimization for Machine Learning Edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright The MIT Press Cambridge, Massachusetts London, England
© 2012 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. For information about special quantity discounts, please email special_sales@mitpress.mit.edu This book was set in LaTeX by the authors and editors. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Optimization for machine learning / edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. p. cm. — (Neural information processing series) Includes bibliographical references. ISBN 978-0-262-01646-9 (hardcover : alk. paper) 1. Machine learning— Mathematical models. 2. Mathematical optimization. I. Sra, Suvrit, 1976– II. Nowozin, Sebastian, 1980– III. Wright, Stephen J., 1960– Q325.5.O65 2012 006.3'1—c22 2011002059 10 9 8 7 6 5 4 3 2 1
Contents Series Foreword Preface xi xiii 1 Introduction: Optimization and Machine Learning S. Sra, S. Nowozin, and S. J. Wright 1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 1.2 Regularized Optimization . . . . . . . . . . . . . . . . . . . . 1.3 Summary of the Chapters . . . . . . . . . . . . . . . . . . . . 1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Convex Optimization with Sparsity-Inducing Norms (Block) Coordinate Descent Algorithms F. Bach, R. Jenatton, J. Mairal, and G. Obozinski 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Generic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Proximal Methods 2.4 . . . . . . . . . . . . 2.5 Reweighted-2 Algorithms . . . . . . . . . . . . . . . . . . . . 2.6 Working-Set Methods . . . . . . . . . . . . . . . . . . . . . . 2.7 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 2.8 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Interior-Point Methods for Large-Scale Cone Programming M. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Primal-Dual Interior-Point Methods . . . . . . . . . . . . . . 3.3 Linear and Quadratic Programming . . . . . . . . . . . . . . 3.4 Second-Order Cone Programming . . . . . . . . . . . . . . . . 3.5 Semidefinite Programming . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 7 11 15 19 19 26 27 32 34 36 40 47 48 49 55 56 60 64 71 74 79
vi 3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4 Incremental Gradient, Subgradient, and Proximal Methods Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . Incremental Subgradient-Proximal Methods . . . . . . . . . . for Convex Optimization: A Survey 85 D. P. Bertsekas 86 4.1 4.2 98 4.3 Convergence for Methods with Cyclic Order . . . . . . . . . . 102 4.4 Convergence for Methods with Randomized Order . . . . . . 108 4.5 Some Applications . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods 121 A. Juditsky and A. Nemirovski Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.1 5.2 Mirror Descent Algorithm: Minimizing over a Simple Set . . . 126 5.3 Problems with Functional Constraints . . . . . . . . . . . . . 130 5.4 Minimizing Strongly Convex Functions . . . . . . . . . . . . . 131 5.5 Mirror Descent Stochastic Approximation . . . . . . . . . . . 134 5.6 Mirror Descent for Convex-Concave Saddle-Point Problems . 135 5.7 Setting up a Mirror Descent Method . . . . . . . . . . . . . . 139 5.8 Notes and Remarks . . . . . . . . . . . . . . . . . . . . . . . . 145 5.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem’s Structure 149 A. Juditsky and A. Nemirovski 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.2 Saddle-Point Reformulations of Convex Minimization Problems151 6.3 Mirror-Prox Algorithm . . . . . . . . . . . . . . . . . . . . . . 154 6.4 Accelerating the Mirror-Prox Algorithm . . . . . . . . . . . . 160 6.5 Accelerating First-Order Methods by Randomization . . . . . 171 6.6 Notes and Remarks . . . . . . . . . . . . . . . . . . . . . . . . 179 6.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7 Cutting-Plane Methods in Machine Learning Introduction to Cutting-plane Methods 185 V. Franc, S. Sonnenburg, and T. Werner 7.1 . . . . . . . . . . . . 187 7.2 Regularized Risk Minimization . . . . . . . . . . . . . . . . . 191 7.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . 197
vii 7.4 MAP Inference in Graphical Models . . . . . . . . . . . . . . 203 7.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 8 Introduction to Dual Decomposition for Inference 219 D. Sontag, A. Globerson, and T. Jaakkola Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 8.1 8.2 Motivating Applications . . . . . . . . . . . . . . . . . . . . . 222 8.3 Dual Decomposition and Lagrangian Relaxation . . . . . . . 224 8.4 Subgradient Algorithms . . . . . . . . . . . . . . . . . . . . . 229 8.5 Block Coordinate Descent Algorithms . . . . . . . . . . . . . 232 8.6 Relations to Linear Programming Relaxations . . . . . . . . . 240 8.7 Decoding: Finding the MAP Assignment . . . . . . . . . . . . 242 8.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 8.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 9 Augmented Lagrangian Methods for Learning, Selecting, and Combining Features 255 R. Tomioka, T. Suzuki, and M. Sugiyama 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 9.3 Proximal Minimization Algorithm . . . . . . . . . . . . . . . 263 9.4 Dual Augmented Lagrangian (DAL) Algorithm . . . . . . . . 265 9.5 Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 9.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 9.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 10 The Convex Optimization Approach to Regret Minimization 287 E. Hazan 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 10.2 The RFTL Algorithm and Its Analysis . . . . . . . . . . . . . 291 10.3 The “Primal-Dual” Approach . . . . . . . . . . . . . . . . . . 294 10.4 Convexity of Loss Functions . . . . . . . . . . . . . . . . . . . 298 10.5 Recent Applications . . . . . . . . . . . . . . . . . . . . . . . 300 10.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 11 Projected Newton-type Methods in Machine Learning 305 M. Schmidt, D. Kim, and S. Sra 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 . . . . . . . . . . . . . . . . 306 11.2 Projected Newton-type Methods 11.3 Two-Metric Projection Methods . . . . . . . . . . . . . . . . 312
分享到:
收藏