Algorithms for large-scale convex optimization — DTU 2010
3. Proximal gradient method
• introduction
• proximal mapping
• proximal gradient method
• convergence analysis
• accelerated proximal gradient method
• forward-backward method
3-1
Proximal mapping
the proximal mapping (or proximal operator) of a convex function h is
proxh(x) = argmin
u h(u) +
2
1
2ku − xk2
examples
• h(x) = 0: proxh(x) = x
• h(x) = IC(x) (indicator function of C): proxh is projection on C
proxh(x) = PC(x) = argmin
u∈C ku − xk2
2
• h(x) = tkxk1: proxh is shrinkage (soft threshold) operation
proxh(x)i =
xi − t xi ≥ t
|xi| ≤ t
0
xi + t xi ≤ −t
Proximal gradient method
3-2
Proximal gradient method
unconstrained problem with cost function split in two components
minimize f (x) = g(x) + h(x)
• g convex, differentiable, with dom g = Rn
• h closed, convex, possibly nondifferentiable; proxh is inexpensive
proximal gradient algorithm
x(k) = proxtkhx(k−1) − tk∇g(x(k−1))
tk > 0 is step size, constant or determined by line search
Proximal gradient method
3-3
Interpretation
x+ = proxth (x − t∇g(x))
from definition of proximal operator:
x+ = argmin
u h(u) +
2
1
2t ku − x + t∇g(x)k2
u h(u) + g(x) + ∇g(x)T (u − x) +
= argmin
2
1
2tku − xk2
x+ minimizes h(u) plus a simple quadratic local model of g(u) around x
Proximal gradient method
3-4
Examples
minimize
g(x) + h(x)
gradient method: h(x) = 0, i.e., minimize g(x)
x(k) = x(k−1) − tk∇g(x(k−1))
gradient projection method: h(x) = IC(x), i.e., minimize g(x) over C
x(k) = PC x(k−1) − tk∇g(x(k−1))
Proximal gradient method
3-5
iterative soft-thresholding: h(x) = kxk1, i.e., minimize g(x) + kxk1
and
x(k) = proxtkhx(k−1) − tk∇g(x(k−1))
proxth(u)i =
ui − t ui ≥ t
0
ui + t ui ≥ t
−t ≤ ui ≤ t
proxth(u)i
−t
t
ui
Proximal gradient method
3-6
Outline
• introduction
• proximal mapping
• proximal gradient method
• convergence analysis
• accelerated proximal gradient method
• forward-backward method
Definition
proximal mapping associated with closed convex h
proxh(x) = argmin
u h(u) +
2
1
2ku − xk2
it can be shown that proxh(x) exists and is unique for all x
subgradient characterization
from optimality conditions of minimization in the definition:
u = proxh(x) ⇐⇒ x − u ∈ ∂h(u)
Proximal gradient method
3-7