logo资料库

admm详细介绍.pdf

第1页 / 共125页
第2页 / 共125页
第3页 / 共125页
第4页 / 共125页
第5页 / 共125页
第6页 / 共125页
第7页 / 共125页
第8页 / 共125页
资料共125页,剩余部分请下载后查看
Foundations and Trends R! in Machine Learning Vol. 3, No. 1 (2010) 1–122 c! 2011 S. Boyd, N. Parikh, E. Chu, B. Peleato and J. Eckstein DOI: 10.1561/2200000016 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd1, Neal Parikh2, Eric Chu3 Borja Peleato4 and Jonathan Eckstein5 1 Electrical Engineering Department, Stanford University, Stanford, CA 2 Computer Science Department, Stanford University, Stanford, CA 94305, 94305, USA, boyd@stanford.edu USA, npparikh@cs.stanford.edu 3 Electrical Engineering Department, Stanford University, Stanford, CA 94305, USA, echu508@stanford.edu 4 Electrical Engineering Department, Stanford University, Stanford, CA 94305, USA, peleato@stanford.edu 5 Management Science and Information Systems Department and RUTCOR, Rutgers University, Piscataway, NJ 08854, USA, jeckstei@rci.rutgers.edu
Contents 1 Introduction 2 Precursors 2.1 Dual Ascent 2.2 Dual Decomposition 2.3 Augmented Lagrangians and the Method of Multipliers 3 Alternating Direction Method of Multipliers 3.1 Algorithm 3.2 Convergence 3.3 Optimality Conditions and Stopping Criterion 3.4 Extensions and Variations 3.5 Notes and References 4 General Patterns 4.1 Proximity Operator 4.2 Quadratic Objective Terms 4.3 Smooth Objective Terms 4.4 Decomposition 5 Constrained Convex Optimization 5.1 Convex Feasibility 5.2 Linear and Quadratic Programming 3 7 7 9 10 13 13 15 18 20 23 25 25 26 30 31 33 34 36
6 !1-Norm Problems 6.1 Least Absolute Deviations 6.2 Basis Pursuit 6.3 General !1 Regularized Loss Minimization 6.4 Lasso 6.5 Sparse Inverse Covariance Selection 7 Consensus and Sharing 7.1 Global Variable Consensus Optimization 7.2 General Form Consensus Optimization 7.3 Sharing 8 Distributed Model Fitting 8.1 Examples 8.2 Splitting across Examples 8.3 Splitting across Features 9 Nonconvex Problems 9.1 Nonconvex Constraints 9.2 Bi-convex Problems 10 Implementation 10.1 Abstract Implementation 10.2 MPI 10.3 Graph Computing Frameworks 10.4 MapReduce 11 Numerical Examples 11.1 Small Dense Lasso 11.2 Distributed !1 Regularized Logistic Regression 11.3 Group Lasso with Feature Splitting 11.4 Distributed Large-Scale Lasso with MPI 11.5 Regressor Selection 38 39 41 42 43 45 48 48 53 56 61 62 64 66 73 73 76 78 78 80 81 82 87 88 92 95 97 100
12 Conclusions Acknowledgments A Convergence Proof References 103 105 106 111
Abstract Many problems of recent interest in statistics and machine learning can be posed in the framework of convex optimization. Due to the explosion in size and complexity of modern datasets, it is increasingly important to be able to solve problems with a very large number of fea- tures or training examples. As a result, both the decentralized collection or storage of these datasets as well as accompanying distributed solu- tion methods are either necessary or at least highly desirable. In this review, we argue that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. The method was developed in the 1970s, with roots in the 1950s, and is equivalent or closely related to many other algorithms, such as dual decomposition, the method of multipliers, Douglas–Rachford splitting, Spingarn’s method of partial inverses, Dykstra’s alternating projections, Bregman iterative algorithms for !1 problems, proximal methods, and others. After briefly surveying the theory and history of the algorithm, we discuss applications to a wide variety of statistical and machine learning problems of recent interest, including the lasso, sparse logistic regression, basis pursuit, covariance selection, support vector machines, and many others. We also discuss general distributed optimization, extensions to the nonconvex setting, and efficient imple- mentation, including some details on distributed MPI and Hadoop MapReduce implementations.
1 Introduction In all applied fields, it is now commonplace to attack problems through data analysis, particularly through the use of statistical and machine learning algorithms on what are often large datasets. In industry, this trend has been referred to as ‘Big Data’, and it has had a significant impact in areas as varied as artificial intelligence, internet applications, computational biology, medicine, finance, marketing, journalism, net- work analysis, and logistics. Though these problems arise in diverse application domains, they share some key characteristics. First, the datasets are often extremely large, consisting of hundreds of millions or billions of training examples; second, the data is often very high-dimensional, because it is now possi- ble to measure and store very detailed information about each example; and third, because of the large scale of many applications, the data is often stored or even collected in a distributed manner. As a result, it has become of central importance to develop algorithms that are both rich enough to capture the complexity of modern data, and scalable enough to process huge datasets in a parallelized or fully decentral- ized fashion. Indeed, some researchers [92] have suggested that even highly complex and structured problems may succumb most easily to relatively simple models trained on vast datasets. 3
4 Introduction Many such problems can be posed in the framework of convex opti- mization. Given the significant work on decomposition methods and decentralized algorithms in the optimization community, it is natural to look to parallel optimization algorithms as a mechanism for solving large-scale statistical tasks. This approach also has the benefit that one algorithm could be flexible enough to solve many problems. This review discusses the alternating direction method of multipli- ers (ADMM), a simple but powerful algorithm that is well suited to distributed convex optimization, and in particular to problems aris- ing in applied statistics and machine learning. It takes the form of a decomposition-coordination procedure, in which the solutions to small local subproblems are coordinated to find a solution to a large global problem. ADMM can be viewed as an attempt to blend the benefits of dual decomposition and augmented Lagrangian methods for con- strained optimization, two earlier approaches that we review in §2. It turns out to be equivalent or closely related to many other algorithms as well, such as Douglas-Rachford splitting from numerical analysis, Spingarn’s method of partial inverses, Dykstra’s alternating projec- tions method, Bregman iterative algorithms for !1 problems in signal processing, proximal methods, and many others. The fact that it has been re-invented in different fields over the decades underscores the intuitive appeal of the approach. It is worth emphasizing that the algorithm itself is not new, and that we do not present any new theoretical results. It was first introduced in the mid-1970s by Gabay, Mercier, Glowinski, and Marrocco, though similar ideas emerged as early as the mid-1950s. The algorithm was studied throughout the 1980s, and by the mid-1990s, almost all of the theoretical results mentioned here had been established. The fact that ADMM was developed so far in advance of the ready availability of large-scale distributed computing systems and massive optimization problems may account for why it is not as widely known today as we believe it should be. The main contributions of this review can be summarized as follows: (1) We provide a simple, cohesive discussion of the extensive literature in a way that emphasizes and unifies the aspects of primary importance in applications.
5 (2) We show, through a number of examples, that the algorithm is well suited for a wide variety of large-scale distributed mod- ern problems. We derive methods for decomposing a wide class of statistical problems by training examples and by fea- tures, which is not easily accomplished in general. (3) We place a greater emphasis on practical large-scale imple- mentation than most previous references. In particular, we discuss the implementation of the algorithm in cloud com- puting environments using standard frameworks and provide easily readable implementations of many of our examples. Throughout, the focus is on applications rather than theory, and a main goal is to provide the reader with a kind of ‘toolbox’ that can be applied in many situations to derive and implement a distributed algorithm of practical use. Though the focus here is on parallelism, the algorithm can also be used serially, and it is interesting to note that with no tuning, ADMM can be competitive with the best known methods for some problems. While we have emphasized applications that can be concisely explained, the algorithm would also be a natural fit for more compli- cated problems in areas like graphical models. In addition, though our focus is on statistical learning problems, the algorithm is readily appli- cable in many other cases, such as in engineering design, multi-period portfolio optimization, time series analysis, network flow, or scheduling. Outline We begin in §2 with a brief review of dual decomposition and the method of multipliers, two important precursors to ADMM. This sec- tion is intended mainly for background and can be skimmed. In §3, we present ADMM, including a basic convergence theorem, some vari- ations on the basic version that are useful in practice, and a survey of some of the key literature. A complete convergence proof is given in appendix A. In §4, we describe some general patterns that arise in applications of the algorithm, such as cases when one of the steps in ADMM can
分享到:
收藏