logo资料库

BCD块坐标下降方法.pdf

第1页 / 共20页
第2页 / 共20页
第3页 / 共20页
第4页 / 共20页
第5页 / 共20页
第6页 / 共20页
第7页 / 共20页
第8页 / 共20页
资料共20页,剩余部分请下载后查看
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 109, No. 3, pp. 475–494, JUNE 2001 Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization1 P. TSENG 2 Communicated by O. L. Mangasarian Abstract. We study the convergence properties of a (block) coordinate descent method applied to minimize a nondifferentiable (nonconvex) function f(x1, . . . , xN) with certain separability and regularity proper- ties. Assuming that f is continuous on a compact level set, the sub- sequence convergence of the iterates to a stationary point is shown when either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks or f has at most one minimum in each of NA2 coordinate blocks. If f is quasiconvex and hemivariate in every coordi- nate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further. These results are applied to derive new (and old) convergence results for the proximal minimization algo- rithm, an algorithm of Arimoto and Blahut, and an algorithm of Han. They are applied also to a problem of blind source separation. Key Words. Block coordinate descent, nondifferentiable minimization, stationary point, Gauss–Seidel method, convergence, quasiconvex func- tions, pseudoconvex functions. 1. Introduction A popular method for minimizing a real-valued continuously differen- tiable function f of n real variables, subject to bound constraints, is the (block) coordinate descent method. In this method, the coordinates are par- titioned into N blocks and, at each iteration, f is minimized with respect to one of the coordinate blocks while the other coordinates are held fixed. This method, which is related closely to the Gauss–Seidel and SOR methods for equation solving (Ref. 1), was studied early by Hildreth (Ref. 2) and Warga (Ref. 3), and is described in various books on optimization (Refs. 1 and 4– 1This work was partially supported by the National Science Foundation Grant CCR-9731273. 2Professor, Department of Mathematics, University of Washington, Seattle, Washington. 475 0022-3239兾01兾0600-0475$19.50兾0  2001 Plenum Publishing Corporation
476 JOTA: VOL. 109, NO. 3, JUNE 2001 10). Its applications include channel capacity computation (Refs. 11–12), image reconstruction (Ref. 7), dynamic programming (Refs. 13–15), and flow routing (Ref. 16). It may be applied also to the dual of a linearly constrained, strictly convex program to obtain various decomposition methods (see Refs. 6–7, 17–22, and references therein) and parallel SOR methods (Ref. 23). Convergence of the (block) coordinate descent method requires typi- cally that f be strictly convex (or quasiconvex or hemivariate) differentiable and, taking into account the bound constraints, has bounded level sets (e.g., Refs. 3–4 and 24–25). Zadeh (Ref. 26; also see Ref. 27) relaxed the strict convexity assumption to pseudoconvexity, which allows f to have a non- unique minimum along coordinate directions. For certain classes of convex functions, the level sets need not be bounded (see Refs. 2, 6–7, 17, 19–22, and references therein). If f is not (pseudo)convex, then an example of Powell (Ref. 28) shows that the method may cycle without approaching any stationary point of f. Nonetheless, convergence can be shown for special cases of non(pseudo)convex f, as when f is quadratic (Ref. 29), or f is strictly pseudoconvex in each of NA2 coordinate blocks (Ref. 27), or f has unique minimum in each coordinate block (Ref. 8, p. 159). If f is not differentiable, the coordinate descent method may get stuck at a nonstationary point even when f is convex (e.g., Ref. 4, p. 94). For this reason, it is perceived generally that the method is unsuitable when f is nondifferentiable. However, an exception occurs when the nondifferentiable part of f is separable. Such a structure for f was considered first by Auslender (Ref. 4, p. 94) in the case where f is strongly convex. This structure is implicit in a decomposition method and projection method of Han (Refs. 18, 30), for which f is the convex dual functional associated with a certain linearly constrained convex program (see Ref. 22 for detailed discussions). This structure arises also in least-square problems where an l1-penalty is placed on a subset of the para- meters in order to minimize the support (see Refs. 31–33 and references therein). Motivated by the preceding works, we consider in this paper the non- differentiable (nonconvex) case where the nondifferentiable part of f is sep- arable. Specifically, we assume that f has the following special form: f(x1, . . . , xN)Gf0(x1, . . . , xN)C ∑N kG1 fk(xk), (1) some f0: ℜ n1C···CnN > ℜ∪{S} and some fk: ℜ nk > ℜ∪{S}, kG for 1, . . . , N. Here, N, n1, . . . , nN are positive integers. We assume that f is pro- per, i.e., f 兾≡ S. We will refer to each xk, kG1, . . . , N, as a coordinate block of xG(x1, . . . , xN). We will show that each cluster point of the iterates generated by the (block) coordinate descent method is a stationary point of
JOTA: VOL. 109, NO. 3, JUNE 2001 477 f, provided that f0 has a certain smoothness property (see Lemma 3.1), f is continuous on a compact level set, and either f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, or f has at most one minimum in each of NA2 coordinate blocks (see Theorem 4.1). If f is quasiconvex and hemivariate in every coordinate block, then the assumptions of continuity of f and compactness of the level set may be relaxed further (see Proposition 5.1). These results unify and extend some previous results in Refs. 4, 6, 8, 26–27. For example, previous results assumed that f is pseudoconvex and that f1, . . . , fN are indicator functions for closed convex sets, whereas we assume only that f is pseudoconvex in every pair of coordinate blocks from among NA1 coordinate blocks, with no additional assumption made on f1, . . . , fN. Previous results also did not consider the case where f is not continuous on its effective domain. Lastly, we apply our results to derive new (and old) convergence results for the proximal minimization algorithm, an algorithm of Arimoto and Blahut (Refs. 11–12), and an algorithm of Han (Ref. 30); see Examples 6.1–6.3. We also apply them to a problem of blind source separation described in Refs. 31, 33; see Example 6.4. In our notation, ℜm denotes the space of m-dimensional real column vector. For any x, y∈ℜm, we denote by 〈x, y〉 the Euclidean inner product of x, y and by 兩兩x兩兩 the Euclidean norm of x, i.e., 兩兩x兩兩G1〈x, x〉. For any set S ⊆ ℜm, we denote by int(S) the interior of S and denote bdry(S)GS\int(S). For any h: ℜm > ℜ∪{S}, we denote by dom h the effective domain of h, i.e., dom hG{x∈ℜm兩h(x)FS}. For any x∈dom h and any d∈ℜm, we denote the (lower) directional deriva- tive of h at x in the direction d by h′(x; d)Glim inf λ↓0 [h(xCλd)Ah(x)]兾λ. We say that h is quasiconvex if h(xCλd)⁄max{h(x), h(xCd)}, for all x, d and λ∈[0, 1]; h is pseudoconvex if h(xCd)¤h(x), whenever x∈dom h and h′(x; d)¤0; see Ref. 34, p. 146; and h is hemivariate if h is not constant on any line segment belonging to dom h (Ref. 1). For any nonempty I ⊆ {1, . . . , m}, we
478 JOTA: VOL. 109, NO. 3, JUNE 2001 say that h(x1, . . . , xm) is pseudoconvex [respectively, has at most one mini- mum point]. 2. Block Coordinate Descent Method We describe formally the block coordinate descent (BCD) method below. BCD Method. Initialization. Choose any x0G(x 0 Iteration rC1, r¤0. Given x rG(x r s∈{1, . . . , N} and compute a new iterate N )∈dom f x rC1G(x rC1 , . . . , x rC1 1 N)∈dom f. N)∈dom f, choose an index 1, . . . , x 0 1, . . . , x r satisfying s ∈arg min x rC1 xs f(x r 1, . . . , x r sA1, xs, x r sC1, . . . , x r N), j Gx r x rC1 j, ∀j≠s. We note that the minimization in (2) is attained if X 0G{x: f(x)⁄f(x0)} (2) (3) is bounded and f is lower semicontinuous (lsc) on X 0, so X 0 is compact (Ref. 35). Alternatively, this minimization is attained if f is convex, has a minimum point, and is hemivariate in each coordinate block (but the level sets of f need not be bounded). To ensure convergence, we need further that each coordinate block is chosen sufficiently often in the method. In particu- lar, we will choose the coordinate blocks according to the following rule (see, e.g., Refs. 7–8, 21, 25). Essentially Cyclic Rule. There exists a constant T¤N such that every index s∈{1, . . . , N} is chosen at least once between the rth iteration and the (rCTA1)th iteration, for all r. A well-known special case of this rule, for which TGN, is given below. Cyclic Rule. Choose sGk at iterations k, kCN, kC2N, . . . , for kG 1, . . . , N.
JOTA: VOL. 109, NO. 3, JUNE 2001 479 3. Stationary Points of f We say that z is a stationary point of f if z∈dom f and f ′(z; d)¤0, ∀d. We say that z is a coordinatewise minimum point of f if z∈dom f and f(zC(0, . . . , dk, . . . , 0))¤f(z), (4) for all kG1, . . . , N. Here and throughout, we denote by (0, . . . , dk, . . . , 0) the vector in ℜ n1C···CnN whose kth coordinate block is dk and whose other coordinates are zero. We say that f is regular at z∈dom f if ∀dk∈ℜ nk, f ′(z; d)¤0, such that f ′(z; (0, . . . , dk, . . . , 0))¤0, ∀dG(d1, . . . , dN), kG1, . . . , N. (5) This notion of regularity is weaker than that used by Auslender (Ref. 4, p. 93), which entails f ′(z; d)G ∑N kG1 f ′(z; (0, . . . , dk, . . . , 0)), for all dG(d1, . . . , dN). For example, the function f(x1, x2)Gφ(x1, x2)Cφ(−x1, x2)Cφ(x1, −x2)Cφ(−x1, −x2), where φ(a, b)Gmax{0, aCbA1a2Cb2}, is regular at zG(0, 0) in the sense of (5), but is not regular in the sense of Ref. 4, p. 93. Since (4) implies f ′(z; (0, . . . , dk, . . . , 0))¤0, for all dk, it follows that a coordinatewise minimum point z of f is a stationary point of f whenever f is regular at z. To ensure regularity of f at z, we consider one of the following smoothness assumptions on f0: (A1) dom f0 is open and f0 is Gaˆteaux-differentiable on dom f0. (A2) f0 is Gaˆteaux-differentiable on int(dom f0) and, for every z∈ dom f ∩bdry(dom f0), there exist k∈{1, . . . , N} and dk∈ℜ nk such that f(zC(0, . . . , dk, . . . , 0))F f(z).
480 JOTA: VOL. 109, NO. 3, JUNE 2001 Assumption A1 was considered essentially by Auslender (Ref. 4, Example 2 on p. 94). In contrast to Assumption A1, Assumption A2 allows dom f0 to include boundary points. We will see an application (Example 6.2) where A2 holds but not A1. Lemma 3.1. Under A1, f is regular at each z∈dom f. Under A2, f is regular at each coordinatewise minimum point z of f. Proof. Under A1, if zG(z1, . . . , zN)∈dom f, then z∈dom f0. Under if zG(z1, . . . , zN) is a coordinatewise minimum point of f, then A2, z∉bdry(dom f0), so z∈int(dom f0). Thus, under either A1 or A2, f0 is Gaˆteaux-differentiable at z. Fix any dG(d1, . . . , dN) such that f ′(z; (0, . . . , dk, . . . , 0))¤0, kG1, . . . , N. Then, f ′(z; d)G〈∇f0(z), d〉Clim inf λ↓0 ∑N kG1 [ fk(xkCλdk)Afk(xk)]兾λ ¤〈∇f0(z), d〉C ∑N kG1 G〈∇f0(z), d〉C ∑N kG1 [ fk(xkCλdk)Afk(xk)]兾λ lim inf λ↓0 f ′k(zk; dk) f ′(z; (0, . . . , dk, . . . , 0)) G ∑N kG1 ¤0. 䊐 4. Convergence Analysis: I Our first convergence result unifies and extends a result of Auslender (Ref. 4, p. 95) for the nondifferentiable convex case and some results of Grippo and Sciandrone (Ref. 27), Luenberger (Ref. 8, p. 159), and Zadeh (Ref. 26) for the differentiable case. In what follows, r ≡ (NA1) mod N means rGNA1, 2NA1, 3NA1, . . . . Theorem 4.1. Assume that the level set X 0G{x: f(x)⁄f(x0)} is sequence compact and that {x rG(x r N)}rG0, 1,... generated by the BCD method using the essen- tially cyclic rule is defined and bounded. Moreover, the following statements continuous on X 0. Then, the f is 1, . . . , x r
JOTA: VOL. 109, NO. 3, JUNE 2001 481 hold: (a) (b) (c) f(x1, . . . , xN) f(x1, . . . , xN) is pseudoconvex in (xk, xi) is pseudoconvex in (xk, xi) for every i, k∈ If {1, . . . , N}, and if f is regular at every x∈X 0, then every cluster point of {x r} is a stationary point of f. for every i, k∈ If {1, . . . , NA1}, if f is regular at every x∈X 0, and if the cyclic rule is used, then every cluster point of {x r}r ≡ (NA1) mod N is a stationary point of f. f(x1, . . . , xN) has at most one minimum in xk for kG If 2, . . . , NA1, and if the cyclic rule is used, then every cluster point z of {x r}r ≡ (NA1) mod N is a coordinatewise minimum point of f. In addition, if f is regular at z, then z is a stationary point of f. Proof. Since X 0 is compact, an induction argument on r shows that xrC1 is defined, f(xrC1)⁄f(x r), and xrC1∈X 0 for all rG0, 1, . . . . Thus, {x r} is bounded. Consider any subsequence {x r}r∈R, with R⊆ {0, 1, . . .}, con- verging to some z. For each j∈{1, . . . , T}, {x rATC1Cj}r∈R is bounded, so by passing to a subsequence, if necessary, we can assume that {x rATC1Cj}r∈R converges to some z jG(z j Thus, jG1, . . . , T. 1, . . . , z j N), z TA1Gz. Since { f(x r)} converges monotonically and f is continuous on X 0, we obtain that f(x0)¤ lim r→S f(x r)Gf(z1)G· · ·Gf(z T). (6) By further passing to a subsequence, if necessary, we can assume that the index s chosen at iteration rATC1Cj, j∈{1, . . . , T}, is the same for all r∈ R, which we denote by s j. For each j∈{1, . . . , T}, since s j is chosen at iteration rATC1Cj for ∀ds j, jG1, . . . , T, ∀k≠s j, jG2, . . . , T. r∈R, then (2) and (3) yield f(x rATC1Cj)⁄f(x rATC1CjC(0, . . . , ds j, . . . , 0)), x rATC1Cj Then, the continuity of f on X 0 yields in the limit that Gx rATCj , k k f(z j)⁄f(z jC (0, . . . , ds j, . . . , 0)), kGz jA1 z j ∀k≠s j, , k ∀ds j, jG1, . . . , T, jG2, . . . , T. (7)
482 JOTA: VOL. 109, NO. 3, JUNE 2001 Then, (6) and (7) yield f(z jA1)⁄f(z jA1C(0, . . . , ds j, . . . , 0)), (8) (a), (b) Suppose that f is regular at every x∈X 0 and that f(x1, . . . , xN) is pseudoconvex in (xk, xi) for every i, k∈{s1}∪· · ·∪{sTA1}. This holds under the assumption (a) or under the assumption (b), with {x r}r∈R being any convergent subsequence of {x r}r ≡ (NA1) mod N. We claim that, for jG 1, . . . , TA1, ∀ds j, jG2, . . . , T. f(z j)⁄f(z jC(0, . . . , dk, . . . , 0)), (9) By (7), (9) holds for jG1. Suppose that (9) holds for jG1, . . . , lA1 for some l∈{2, . . . , TA1}. We show that (9) holds for jGl. From (8), we have that ∀dk, ∀kGs1, . . . , s j. f(z lA1)⁄f(z lA1C(0, . . . , dsl, . . . , 0)), ∀dsl, implying f ′(z lA1; (0, . . . , z l slAz lA1 sl , . . . , 0))¤0. Also, since (9) holds for jGlA1, we have that, for each kGs1, . . . , slA1, f ′(z lA1; (0, . . . , dk, . . . , 0))¤0, ∀dk. Since by (6) z lA1∈X 0, so f is regular at z lA1, the above two relations imply ∀dk. Since f is pseudoconvex in (xk, xsl), this yields [also using z lGz lA1C (0, . . . , z l f ′(z lA1; (0, . . . , dk, . . . , 0)C(0, . . . , z l , . . . , 0))¤0, slAz lA1 sl sl , . . . , 0)] for kGs1, . . . , slA1 that slAz lA1 f(z lC(0, . . . , dk, . . . , 0))¤f(z lA1)Gf(z l), ∀dk. Since we have also that (7) holds with jGl, we see that (9) holds for jGl. By induction, (9) holds for all jG1, . . . , TA1. Since z TA1Gz and (9) holds for jGTA1, then (4) holds for kG s1, . . . , sTA1. Since z TA1Gz and (8) holds (in particular, for jGT ), then (4) holds for kGsT also. Since {1, . . . , N}G{s1}∪· · ·∪{sT}, this implies that z is a coordinatewise minimum point of f. Since f is regular at z, then z is in fact a stationary point of f. (c) Suppose that f(x1, . . . , xN) has at most one minimum in xk for kG s2, . . . , sTA1. This holds under the assumption (c), with {x r}r∈R being any convergent subsequence of {x r}r ≡ (NA1) mod N. For each jG2, . . . , TA1, since
分享到:
收藏