JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 109, No. 3, pp. 475–494, JUNE 2001
Convergence of a Block Coordinate Descent
Method for Nondifferentiable Minimization1
P. TSENG
2
Communicated by O. L. Mangasarian
Abstract. We study the convergence properties of a (block) coordinate
descent method applied to minimize a nondifferentiable (nonconvex)
function f(x1, . . . , xN) with certain separability and regularity proper-
ties. Assuming that f is continuous on a compact level set, the sub-
sequence convergence of the iterates to a stationary point is shown when
either f is pseudoconvex in every pair of coordinate blocks from among
NA1 coordinate blocks or f has at most one minimum in each of NA2
coordinate blocks. If f is quasiconvex and hemivariate in every coordi-
nate block, then the assumptions of continuity of f and compactness of
the level set may be relaxed further. These results are applied to derive
new (and old) convergence results for the proximal minimization algo-
rithm, an algorithm of Arimoto and Blahut, and an algorithm of Han.
They are applied also to a problem of blind source separation.
Key Words. Block coordinate descent, nondifferentiable minimization,
stationary point, Gauss–Seidel method, convergence, quasiconvex func-
tions, pseudoconvex functions.
1. Introduction
A popular method for minimizing a real-valued continuously differen-
tiable function f of n real variables, subject to bound constraints, is the
(block) coordinate descent method. In this method, the coordinates are par-
titioned into N blocks and, at each iteration, f is minimized with respect to
one of the coordinate blocks while the other coordinates are held fixed. This
method, which is related closely to the Gauss–Seidel and SOR methods for
equation solving (Ref. 1), was studied early by Hildreth (Ref. 2) and Warga
(Ref. 3), and is described in various books on optimization (Refs. 1 and 4–
1This work was partially supported by the National Science Foundation Grant CCR-9731273.
2Professor, Department of Mathematics, University of Washington, Seattle, Washington.
475
0022-3239兾01兾0600-0475$19.50兾0 2001 Plenum Publishing Corporation
476
JOTA: VOL. 109, NO. 3, JUNE 2001
10). Its applications include channel capacity computation (Refs. 11–12),
image reconstruction (Ref. 7), dynamic programming (Refs. 13–15), and
flow routing (Ref. 16). It may be applied also to the dual of a linearly
constrained, strictly convex program to obtain various decomposition
methods (see Refs. 6–7, 17–22, and references therein) and parallel SOR
methods (Ref. 23).
Convergence of the (block) coordinate descent method requires typi-
cally that f be strictly convex (or quasiconvex or hemivariate) differentiable
and, taking into account the bound constraints, has bounded level sets (e.g.,
Refs. 3–4 and 24–25). Zadeh (Ref. 26; also see Ref. 27) relaxed the strict
convexity assumption to pseudoconvexity, which allows f to have a non-
unique minimum along coordinate directions. For certain classes of convex
functions, the level sets need not be bounded (see Refs. 2, 6–7, 17, 19–22,
and references therein). If f is not (pseudo)convex, then an example of
Powell (Ref. 28) shows that the method may cycle without approaching any
stationary point of f. Nonetheless, convergence can be shown for special
cases of non(pseudo)convex f, as when f is quadratic (Ref. 29), or f is strictly
pseudoconvex in each of NA2 coordinate blocks (Ref. 27), or f has unique
minimum in each coordinate block (Ref. 8, p. 159). If f is not differentiable,
the coordinate descent method may get stuck at a nonstationary point even
when f is convex (e.g., Ref. 4, p. 94). For this reason, it is perceived generally
that the method is unsuitable when f is nondifferentiable. However, an
exception occurs when the nondifferentiable part of f is separable. Such a
structure for f was considered first by Auslender (Ref. 4, p. 94) in the case
where f is strongly convex. This structure is implicit in a decomposition
method and projection method of Han (Refs. 18, 30), for which f is the
convex dual functional associated with a certain linearly constrained convex
program (see Ref. 22 for detailed discussions). This structure arises also in
least-square problems where an l1-penalty is placed on a subset of the para-
meters in order to minimize the support (see Refs. 31–33 and references
therein).
Motivated by the preceding works, we consider in this paper the non-
differentiable (nonconvex) case where the nondifferentiable part of f is sep-
arable. Specifically, we assume that f has the following special form:
f(x1, . . . , xN)Gf0(x1, . . . , xN)C ∑N
kG1
fk(xk),
(1)
some f0: ℜ n1C···CnN > ℜ∪{S} and some fk: ℜ nk > ℜ∪{S}, kG
for
1, . . . , N. Here, N, n1, . . . , nN are positive integers. We assume that f is pro-
per, i.e., f 兾≡ S. We will refer to each xk, kG1, . . . , N, as a coordinate block
of xG(x1, . . . , xN). We will show that each cluster point of the iterates
generated by the (block) coordinate descent method is a stationary point of
JOTA: VOL. 109, NO. 3, JUNE 2001
477
f, provided that f0 has a certain smoothness property (see Lemma 3.1), f is
continuous on a compact level set, and either f is pseudoconvex in every
pair of coordinate blocks from among NA1 coordinate blocks, or f has at
most one minimum in each of NA2 coordinate blocks (see Theorem 4.1).
If f is quasiconvex and hemivariate in every coordinate block, then the
assumptions of continuity of f and compactness of the level set may be
relaxed further (see Proposition 5.1). These results unify and extend some
previous results in Refs. 4, 6, 8, 26–27. For example, previous results
assumed that f is pseudoconvex and that f1, . . . , fN are indicator functions
for closed convex sets, whereas we assume only that f is pseudoconvex in
every pair of coordinate blocks from among NA1 coordinate blocks, with
no additional assumption made on f1, . . . , fN. Previous results also did not
consider the case where f is not continuous on its effective domain. Lastly,
we apply our results to derive new (and old) convergence results for the
proximal minimization algorithm, an algorithm of Arimoto and Blahut
(Refs. 11–12), and an algorithm of Han (Ref. 30); see Examples 6.1–6.3.
We also apply them to a problem of blind source separation described in
Refs. 31, 33; see Example 6.4.
In our notation, ℜm denotes the space of m-dimensional real column
vector. For any x, y∈ℜm, we denote by 〈x, y〉 the Euclidean inner product
of x, y and by 兩兩x兩兩 the Euclidean norm of x, i.e.,
兩兩x兩兩G1〈x, x〉.
For any set S ⊆ ℜm, we denote by int(S) the interior of S and denote
bdry(S)GS\int(S).
For any h: ℜm > ℜ∪{S}, we denote by dom h the effective domain of h,
i.e.,
dom hG{x∈ℜm兩h(x)FS}.
For any x∈dom h and any d∈ℜm, we denote the (lower) directional deriva-
tive of h at x in the direction d by
h′(x; d)Glim inf
λ↓0
[h(xCλd)Ah(x)]兾λ.
We say that h is quasiconvex if
h(xCλd)⁄max{h(x), h(xCd)},
for all x, d and λ∈[0, 1];
h is pseudoconvex if
h(xCd)¤h(x),
whenever x∈dom h and h′(x; d)¤0;
see Ref. 34, p. 146; and h is hemivariate if h is not constant on any line
segment belonging to dom h (Ref. 1). For any nonempty I ⊆ {1, . . . , m}, we
478
JOTA: VOL. 109, NO. 3, JUNE 2001
say that h(x1, . . . , xm) is pseudoconvex [respectively, has at most one mini-
mum point].
2. Block Coordinate Descent Method
We describe formally the block coordinate descent (BCD) method
below.
BCD Method.
Initialization. Choose any x0G(x 0
Iteration rC1, r¤0. Given x rG(x r
s∈{1, . . . , N} and compute a new iterate
N )∈dom f
x rC1G(x rC1
, . . . , x rC1
1
N)∈dom f.
N)∈dom f, choose an index
1, . . . , x 0
1, . . . , x r
satisfying
s ∈arg min
x rC1
xs
f(x r
1, . . . , x r
sA1, xs, x r
sC1, . . . , x r
N),
j Gx r
x rC1
j,
∀j≠s.
We note that the minimization in (2) is attained if
X 0G{x: f(x)⁄f(x0)}
(2)
(3)
is bounded and f is lower semicontinuous (lsc) on X 0, so X 0 is compact
(Ref. 35). Alternatively, this minimization is attained if f is convex, has a
minimum point, and is hemivariate in each coordinate block (but the level
sets of f need not be bounded). To ensure convergence, we need further that
each coordinate block is chosen sufficiently often in the method. In particu-
lar, we will choose the coordinate blocks according to the following rule
(see, e.g., Refs. 7–8, 21, 25).
Essentially Cyclic Rule. There exists a constant T¤N such that every
index s∈{1, . . . , N} is chosen at least once between the rth iteration and the
(rCTA1)th iteration, for all r.
A well-known special case of this rule, for which TGN, is given below.
Cyclic Rule. Choose sGk at iterations k, kCN, kC2N, . . . , for kG
1, . . . , N.
JOTA: VOL. 109, NO. 3, JUNE 2001
479
3. Stationary Points of f
We say that z is a stationary point of f if z∈dom f and
f ′(z; d)¤0,
∀d.
We say that z is a coordinatewise minimum point of f if z∈dom f and
f(zC(0, . . . , dk, . . . , 0))¤f(z),
(4)
for all kG1, . . . , N. Here and throughout, we denote by (0, . . . , dk, . . . , 0)
the vector in ℜ n1C···CnN whose kth coordinate block is dk and whose other
coordinates are zero. We say that f is regular at z∈dom f if
∀dk∈ℜ nk,
f ′(z; d)¤0,
such that f ′(z; (0, . . . , dk, . . . , 0))¤0,
∀dG(d1, . . . , dN),
kG1, . . . , N.
(5)
This notion of regularity is weaker than that used by Auslender (Ref. 4,
p. 93), which entails
f ′(z; d)G ∑N
kG1
f ′(z; (0, . . . , dk, . . . , 0)),
for all dG(d1, . . . , dN).
For example, the function
f(x1, x2)Gφ(x1, x2)Cφ(−x1, x2)Cφ(x1, −x2)Cφ(−x1, −x2),
where
φ(a, b)Gmax{0, aCbA1a2Cb2},
is regular at zG(0, 0) in the sense of (5), but is not regular in the sense of
Ref. 4, p. 93.
Since (4) implies
f ′(z; (0, . . . , dk, . . . , 0))¤0,
for all dk,
it follows that a coordinatewise minimum point z of f is a stationary point
of f whenever f is regular at z. To ensure regularity of f at z, we consider
one of the following smoothness assumptions on f0:
(A1) dom f0 is open and f0 is Gaˆteaux-differentiable on dom f0.
(A2)
f0 is Gaˆteaux-differentiable on int(dom f0) and, for every z∈
dom f ∩bdry(dom f0), there exist k∈{1, . . . , N} and dk∈ℜ nk
such that f(zC(0, . . . , dk, . . . , 0))F f(z).
480
JOTA: VOL. 109, NO. 3, JUNE 2001
Assumption A1 was considered essentially by Auslender (Ref. 4,
Example 2 on p. 94). In contrast to Assumption A1, Assumption A2 allows
dom f0 to include boundary points. We will see an application (Example
6.2) where A2 holds but not A1.
Lemma 3.1. Under A1, f is regular at each z∈dom f. Under A2, f
is regular at each coordinatewise minimum point z of f.
Proof. Under A1, if zG(z1, . . . , zN)∈dom f, then z∈dom f0. Under
if zG(z1, . . . , zN) is a coordinatewise minimum point of f, then
A2,
z∉bdry(dom f0), so z∈int(dom f0). Thus, under either A1 or A2, f0 is
Gaˆteaux-differentiable at z. Fix any dG(d1, . . . , dN) such that
f ′(z; (0, . . . , dk, . . . , 0))¤0,
kG1, . . . , N.
Then,
f ′(z; d)G〈∇f0(z), d〉Clim inf
λ↓0
∑N
kG1
[ fk(xkCλdk)Afk(xk)]兾λ
¤〈∇f0(z), d〉C ∑N
kG1
G〈∇f0(z), d〉C ∑N
kG1
[ fk(xkCλdk)Afk(xk)]兾λ
lim inf
λ↓0
f ′k(zk; dk)
f ′(z; (0, . . . , dk, . . . , 0))
G ∑N
kG1
¤0.
䊐
4. Convergence Analysis: I
Our first convergence result unifies and extends a result of Auslender
(Ref. 4, p. 95) for the nondifferentiable convex case and some results of
Grippo and Sciandrone (Ref. 27), Luenberger (Ref. 8, p. 159), and Zadeh
(Ref. 26) for the differentiable case. In what follows, r ≡ (NA1) mod N
means rGNA1, 2NA1, 3NA1, . . . .
Theorem 4.1. Assume that
the level set X 0G{x: f(x)⁄f(x0)} is
sequence
compact and that
{x rG(x r
N)}rG0, 1,... generated by the BCD method using the essen-
tially cyclic rule is defined and bounded. Moreover, the following statements
continuous on X 0. Then,
the
f
is
1, . . . , x r
JOTA: VOL. 109, NO. 3, JUNE 2001
481
hold:
(a)
(b)
(c)
f(x1, . . . , xN)
f(x1, . . . , xN)
is pseudoconvex in (xk, xi)
is pseudoconvex in (xk, xi)
for every i, k∈
If
{1, . . . , N}, and if f is regular at every x∈X 0, then every cluster
point of {x r} is a stationary point of f.
for every i, k∈
If
{1, . . . , NA1}, if f is regular at every x∈X 0, and if the cyclic rule
is used, then every cluster point of {x r}r ≡ (NA1) mod N is a stationary
point of f.
f(x1, . . . , xN) has at most one minimum in xk for kG
If
2, . . . , NA1, and if the cyclic rule is used, then every cluster point
z of {x r}r ≡ (NA1) mod N is a coordinatewise minimum point of f. In
addition, if f is regular at z, then z is a stationary point of f.
Proof. Since X 0 is compact, an induction argument on r shows that
xrC1 is defined, f(xrC1)⁄f(x r), and xrC1∈X 0 for all rG0, 1, . . . . Thus, {x r}
is bounded. Consider any subsequence {x r}r∈R, with R⊆ {0, 1, . . .}, con-
verging to some z. For each j∈{1, . . . , T}, {x rATC1Cj}r∈R is bounded, so by
passing to a subsequence, if necessary, we can assume that
{x rATC1Cj}r∈R converges to some z jG(z j
Thus,
jG1, . . . , T.
1, . . . , z j
N),
z TA1Gz.
Since { f(x r)} converges monotonically and f is continuous on X 0, we
obtain that
f(x0)¤ lim
r→S
f(x r)Gf(z1)G· · ·Gf(z T).
(6)
By further passing to a subsequence, if necessary, we can assume that the
index s chosen at iteration rATC1Cj, j∈{1, . . . , T}, is the same for all r∈
R, which we denote by s j.
For each j∈{1, . . . , T}, since s j is chosen at iteration rATC1Cj for
∀ds j, jG1, . . . , T,
∀k≠s j, jG2, . . . , T.
r∈R, then (2) and (3) yield
f(x rATC1Cj)⁄f(x rATC1CjC(0, . . . , ds j, . . . , 0)),
x rATC1Cj
Then, the continuity of f on X 0 yields in the limit that
Gx rATCj
,
k
k
f(z j)⁄f(z jC (0, . . . , ds j, . . . , 0)),
kGz jA1
z j
∀k≠s j,
,
k
∀ds j, jG1, . . . , T,
jG2, . . . , T.
(7)
482
JOTA: VOL. 109, NO. 3, JUNE 2001
Then, (6) and (7) yield
f(z jA1)⁄f(z jA1C(0, . . . , ds j, . . . , 0)),
(8)
(a), (b) Suppose that f is regular at every x∈X 0 and that f(x1, . . . , xN)
is pseudoconvex in (xk, xi) for every i, k∈{s1}∪· · ·∪{sTA1}. This holds
under the assumption (a) or under the assumption (b), with {x r}r∈R being
any convergent subsequence of {x r}r ≡ (NA1) mod N. We claim that, for jG
1, . . . , TA1,
∀ds j, jG2, . . . , T.
f(z j)⁄f(z jC(0, . . . , dk, . . . , 0)),
(9)
By (7), (9) holds for jG1. Suppose that (9) holds for jG1, . . . , lA1 for some
l∈{2, . . . , TA1}. We show that (9) holds for jGl. From (8), we have that
∀dk, ∀kGs1, . . . , s j.
f(z lA1)⁄f(z lA1C(0, . . . , dsl, . . . , 0)),
∀dsl,
implying
f ′(z lA1; (0, . . . , z l
slAz lA1
sl
, . . . , 0))¤0.
Also, since (9) holds for jGlA1, we have that, for each kGs1, . . . , slA1,
f ′(z lA1; (0, . . . , dk, . . . , 0))¤0,
∀dk.
Since by (6) z lA1∈X 0, so f is regular at z lA1, the above two relations imply
∀dk.
Since f is pseudoconvex in (xk, xsl), this yields [also using z lGz lA1C
(0, . . . , z l
f ′(z lA1; (0, . . . , dk, . . . , 0)C(0, . . . , z l
, . . . , 0))¤0,
slAz lA1
sl
sl
, . . . , 0)] for kGs1, . . . , slA1 that
slAz lA1
f(z lC(0, . . . , dk, . . . , 0))¤f(z lA1)Gf(z l),
∀dk.
Since we have also that (7) holds with jGl, we see that (9) holds for jGl.
By induction, (9) holds for all jG1, . . . , TA1.
Since z TA1Gz and (9) holds for jGTA1, then (4) holds for kG
s1, . . . , sTA1. Since z TA1Gz and (8) holds (in particular, for jGT ), then (4)
holds for kGsT also. Since
{1, . . . , N}G{s1}∪· · ·∪{sT},
this implies that z is a coordinatewise minimum point of f. Since f is regular
at z, then z is in fact a stationary point of f.
(c) Suppose that f(x1, . . . , xN) has at most one minimum in xk for kG
s2, . . . , sTA1. This holds under the assumption (c), with {x r}r∈R being any
convergent subsequence of {x r}r ≡ (NA1) mod N. For each jG2, . . . , TA1, since