Foundations and Trends R in
Machine Learning
Vol. 3, No. 1 (2010) 1–122
c 2011 S. Boyd, N. Parikh, E. Chu, B. Peleato
and J. Eckstein
DOI: 10.1561/2200000016
Distributed Optimization and Statistical
Learning via the Alternating Direction
Method of Multipliers
Stephen Boyd1, Neal Parikh2, Eric Chu3
Borja Peleato4 and Jonathan Eckstein5
1 Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, boyd@stanford.edu
2 Computer Science Department, Stanford University, Stanford, CA 94305,
USA, npparikh@cs.stanford.edu
3 Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, echu508@stanford.edu
4 Electrical Engineering Department, Stanford University, Stanford, CA
94305, USA, peleato@stanford.edu
5 Management Science and Information Systems Department and
RUTCOR, Rutgers University, Piscataway, NJ 08854, USA,
jeckstei@rci.rutgers.edu
Contents
1 Introduction
2 Precursors
2.1 Dual Ascent
2.2 Dual Decomposition
2.3 Augmented Lagrangians and the Method of Multipliers
3 Alternating Direction Method of Multipliers
3.1 Algorithm
3.2 Convergence
3.3 Optimality Conditions and Stopping Criterion
3.4 Extensions and Variations
3.5 Notes and References
4 General Patterns
4.1 Proximity Operator
4.2 Quadratic Objective Terms
4.3 Smooth Objective Terms
4.4 Decomposition
5 Constrained Convex Optimization
5.1 Convex Feasibility
5.2 Linear and Quadratic Programming
3
7
7
9
10
13
13
15
18
20
23
25
25
26
30
31
33
34
36
6 1-Norm Problems
6.1 Least Absolute Deviations
6.2 Basis Pursuit
6.3 General 1 Regularized Loss Minimization
6.4 Lasso
6.5 Sparse Inverse Covariance Selection
7 Consensus and Sharing
7.1 Global Variable Consensus Optimization
7.2 General Form Consensus Optimization
7.3 Sharing
8 Distributed Model Fitting
8.1 Examples
8.2 Splitting across Examples
8.3 Splitting across Features
9 Nonconvex Problems
9.1 Nonconvex Constraints
9.2 Bi-convex Problems
10 Implementation
10.1 Abstract Implementation
10.2 MPI
10.3 Graph Computing Frameworks
10.4 MapReduce
11 Numerical Examples
11.1 Small Dense Lasso
11.2 Distributed 1 Regularized Logistic Regression
11.3 Group Lasso with Feature Splitting
11.4 Distributed Large-Scale Lasso with MPI
11.5 Regressor Selection
38
39
41
42
43
45
48
48
53
56
61
62
64
66
73
73
76
78
78
80
81
82
87
88
92
95
97
100
12 Conclusions
Acknowledgments
A Convergence Proof
References
103
105
106
111
Abstract
Many problems of recent interest in statistics and machine learning
can be posed in the framework of convex optimization. Due to the
explosion in size and complexity of modern datasets, it is increasingly
important to be able to solve problems with a very large number of fea-
tures or training examples. As a result, both the decentralized collection
or storage of these datasets as well as accompanying distributed solu-
tion methods are either necessary or at least highly desirable. In this
review, we argue that the alternating direction method of multipliers
is well suited to distributed convex optimization, and in particular to
large-scale problems arising in statistics, machine learning, and related
areas. The method was developed in the 1970s, with roots in the 1950s,
and is equivalent or closely related to many other algorithms, such
as dual decomposition, the method of multipliers, Douglas–Rachford
splitting, Spingarn’s method of partial inverses, Dykstra’s alternating
projections, Bregman iterative algorithms for 1 problems, proximal
methods, and others. After briefly surveying the theory and history of
the algorithm, we discuss applications to a wide variety of statistical
and machine learning problems of recent interest, including the lasso,
sparse logistic regression, basis pursuit, covariance selection, support
vector machines, and many others. We also discuss general distributed
optimization, extensions to the nonconvex setting, and efficient imple-
mentation, including some details on distributed MPI and Hadoop
MapReduce implementations.
1
Introduction
In all applied fields, it is now commonplace to attack problems through
data analysis, particularly through the use of statistical and machine
learning algorithms on what are often large datasets. In industry, this
trend has been referred to as ‘Big Data’, and it has had a significant
impact in areas as varied as artificial intelligence, internet applications,
computational biology, medicine, finance, marketing, journalism, net-
work analysis, and logistics.
Though these problems arise in diverse application domains, they
share some key characteristics. First, the datasets are often extremely
large, consisting of hundreds of millions or billions of training examples;
second, the data is often very high-dimensional, because it is now possi-
ble to measure and store very detailed information about each example;
and third, because of the large scale of many applications, the data is
often stored or even collected in a distributed manner. As a result, it
has become of central importance to develop algorithms that are both
rich enough to capture the complexity of modern data, and scalable
enough to process huge datasets in a parallelized or fully decentral-
ized fashion. Indeed, some researchers [92] have suggested that even
highly complex and structured problems may succumb most easily to
relatively simple models trained on vast datasets.
3
4
Introduction
Many such problems can be posed in the framework of convex opti-
mization. Given the significant work on decomposition methods and
decentralized algorithms in the optimization community, it is natural
to look to parallel optimization algorithms as a mechanism for solving
large-scale statistical tasks. This approach also has the benefit that one
algorithm could be flexible enough to solve many problems.
This review discusses the alternating direction method of multipli-
ers (ADMM), a simple but powerful algorithm that is well suited to
distributed convex optimization, and in particular to problems aris-
ing in applied statistics and machine learning. It takes the form of a
decomposition-coordination procedure, in which the solutions to small
local subproblems are coordinated to find a solution to a large global
problem. ADMM can be viewed as an attempt to blend the benefits
of dual decomposition and augmented Lagrangian methods for con-
strained optimization, two earlier approaches that we review in §2. It
turns out to be equivalent or closely related to many other algorithms
as well, such as Douglas-Rachford splitting from numerical analysis,
Spingarn’s method of partial inverses, Dykstra’s alternating projec-
tions method, Bregman iterative algorithms for 1 problems in signal
processing, proximal methods, and many others. The fact that it has
been re-invented in different fields over the decades underscores the
intuitive appeal of the approach.
It is worth emphasizing that the algorithm itself is not new, and that
we do not present any new theoretical results. It was first introduced
in the mid-1970s by Gabay, Mercier, Glowinski, and Marrocco, though
similar ideas emerged as early as the mid-1950s. The algorithm was
studied throughout the 1980s, and by the mid-1990s, almost all of the
theoretical results mentioned here had been established. The fact that
ADMM was developed so far in advance of the ready availability of
large-scale distributed computing systems and massive optimization
problems may account for why it is not as widely known today as we
believe it should be.
The main contributions of this review can be summarized as follows:
(1) We provide a simple, cohesive discussion of the extensive
literature in a way that emphasizes and unifies the aspects
of primary importance in applications.
5
(2) We show, through a number of examples, that the algorithm
is well suited for a wide variety of large-scale distributed mod-
ern problems. We derive methods for decomposing a wide
class of statistical problems by training examples and by fea-
tures, which is not easily accomplished in general.
(3) We place a greater emphasis on practical large-scale imple-
mentation than most previous references. In particular, we
discuss the implementation of the algorithm in cloud com-
puting environments using standard frameworks and provide
easily readable implementations of many of our examples.
Throughout, the focus is on applications rather than theory, and a main
goal is to provide the reader with a kind of ‘toolbox’ that can be applied
in many situations to derive and implement a distributed algorithm of
practical use. Though the focus here is on parallelism, the algorithm
can also be used serially, and it is interesting to note that with no
tuning, ADMM can be competitive with the best known methods for
some problems.
While we have emphasized applications that can be concisely
explained, the algorithm would also be a natural fit for more compli-
cated problems in areas like graphical models. In addition, though our
focus is on statistical learning problems, the algorithm is readily appli-
cable in many other cases, such as in engineering design, multi-period
portfolio optimization, time series analysis, network flow, or scheduling.
Outline
We begin in §2 with a brief review of dual decomposition and the
method of multipliers, two important precursors to ADMM. This sec-
tion is intended mainly for background and can be skimmed. In §3,
we present ADMM, including a basic convergence theorem, some vari-
ations on the basic version that are useful in practice, and a survey of
some of the key literature. A complete convergence proof is given in
appendix A.
In §4, we describe some general patterns that arise in applications
of the algorithm, such as cases when one of the steps in ADMM can