A SINGULAR VALUE THRESHOLDING ALGORITHM FOR
MATRIX COMPLETION
JIAN-FENG CAI∗, EMMANUEL J. CAND`ES† , AND ZUOWEI SHEN‡
Abstract. This paper introduces a novel algorithm to approximate the matrix with minimum
nuclear norm among all matrices obeying a set of convex constraints. This problem may be un-
derstood as the convex relaxation of a rank minimization problem, and arises in many important
applications as in the task of recovering a large matrix from a small subset of its entries (the famous
Netflix problem). Off-the-shelf algorithms such as interior point methods are not directly amenable
to large problems of this kind with over a million unknown entries.
This paper develops a simple first-order and easy-to-implement algorithm that is extremely effi-
cient at addressing problems in which the optimal solution has low rank. The algorithm is iterative
and produces a sequence of matrices {X k, Y k} and at each step, mainly performs a soft-thresholding
operation on the singular values of the matrix Y k. There are two remarkable features making this
attractive for low-rank matrix completion problems. The first is that the soft-thresholding opera-
tion is applied to a sparse matrix; the second is that the rank of the iterates {X k} is empirically
nondecreasing. Both these facts allow the algorithm to make use of very minimal storage space and
keep the computational cost of each iteration low. On the theoretical side, we provide a convergence
analysis showing that the sequence of iterates converges. On the practical side, we provide numerical
examples in which 1, 000 × 1, 000 matrices are recovered in less than a minute on a modest desktop
computer. We also demonstrate that our approach is amenable to very large scale problems by recov-
ering matrices of rank about 10 with nearly a billion unknowns from just about 0.4% of their sampled
entries. Our methods are connected with the recent literature on linearized Bregman iterations for
1 minimization, and we develop a framework in which one can understand these algorithms in terms
of well-known Lagrange multiplier algorithms.
Keywords. Nuclear norm minimization, matrix completion, singular value thresh-
olding, Lagrange dual function, Uzawa’s algorithm, and linearized Bregman iteration.
1. Introduction.
1.1. Motivation. There is a rapidly growing interest in the recovery of an un-
known low-rank or approximately low-rank matrix from very limited information.
This problem occurs in many areas of engineering and applied science such as ma-
chine learning [2–4], control [54] and computer vision, see [62]. As a motivating
example, consider the problem of recovering a data matrix from a sampling of its
entries. This routinely comes up whenever one collects partially filled out surveys,
and one would like to infer the many missing entries. In the area of recommender
systems, users submit ratings on a subset of entries in a database, and the vendor
provides recommendations based on the user’s preferences. Because users only rate a
few items, one would like to infer their preference for unrated items; this is the famous
Netflix problem [1]. Recovering a rectangular matrix from a sampling of its entries
is known as the matrix completion problem. The issue is of course that this problem
is extraordinarily ill posed since with fewer samples than entries, we have infinitely
many completions. Therefore, it is apparently impossible to identify which of these
candidate solutions is indeed the “correct” one without some additional information.
In many instances, however, the matrix we wish to recover has low rank or ap-
proximately low rank. For instance, the Netflix data matrix of all user-ratings may
be approximately low-rank because it is commonly believed that only a few factors
contribute to anyone’s taste or preference. In computer vision, inferring scene geom-
etry and camera motion from a sequence of images is a well-studied problem known
∗Department of Mathematics, University of California, Los Angeles CA 90095
†Applied and Computational Mathematics, Caltech, Pasadena CA 91125
‡Department of Mathematics, National University of Singapore, Singapore 117543
1
as the structure-from-motion problem. This is an ill-conditioned problem for objects
may be distant with respect to their size, or especially for “missing data” which oc-
cur because of occlusion or tracking failures. However, when properly stacked and
indexed, these images form a matrix which has very low rank (e.g. rank 3 under or-
thography) [24,62]. Other examples of low-rank matrix fitting abound; e.g. in control
(system identification), machine learning (multi-class learning) and so on. Having said
this, the premise that the unknown has (approximately) low rank radically changes
the problem, making the search for solutions feasible since the lowest-rank solution
now tends to be the right one.
In a recent paper [16], Cand`es and Recht showed that matrix completion is not
as ill-posed as people thought. Indeed, they proved that most low-rank matrices can
be recovered exactly from most sets of sampled entries even though these sets have
surprisingly small cardinality, and more importantly, they proved that this can be
done by solving a simple convex optimization problem. To state their results, suppose
to simplify that the unknown matrix M ∈ Rn×n is square, and that one has available
m sampled entries {Mij : (i, j) ∈ Ω} where Ω is a random subset of cardinality m.
Then [16] proves that most matrices M of rank r can be perfectly recovered by solving
the optimization problem
minimize
subject to
X∗
Xij = Mij,
(i, j) ∈ Ω,
(1.1)
provided that the number of samples obeys m ≥ Cn6/5r log n for some positive nu-
merical constant C.1 In (1.1), the functional X∗ is the nuclear norm of the matrix
M, which is the sum of its singular values. The optimization problem (1.1) is convex
and can be recast as a semidefinite program [34,35]. In some sense, this is the tightest
convex relaxation of the NP-hard rank minimization problem
minimize
subject to
rank(X)
Xij = Mij,
(i, j) ∈ Ω,
(1.2)
since the nuclear ball {X : X∗ ≤ 1} is the convex hull of the set of rank-one
matrices with spectral norm bounded by one. Another interpretation of Cand`es and
Recht’s result is that under suitable conditions, the rank minimization program (1.2)
and the convex program (1.1) are formally equivalent in the sense that they have
exactly the same unique solution.
1.2. Algorithm outline. Because minimizing the nuclear norm both provably
recovers the lowest-rank matrix subject to constraints (see [57] for related results) and
gives generally good empirical results in a variety of situations, it is understandably of
great interest to develop numerical methods for solving (1.1). In [16], this optimization
problem was solved using one of the most advanced semidefinite programming solvers,
namely, SDPT3 [60]. This solver and others like SeDuMi are based on interior-point
methods, and are problematic when the size of the matrix is large because they need
to solve huge systems of linear equations to compute the Newton direction. In fact,
SDPT3 can only handle n × n matrices with n ≤ 100. Presumably, one could resort
to iterative solvers such as the method of conjugate gradients to solve for the Newton
step but this is problematic as well since it is well known that the condition number
of the Newton system increases rapidly as one gets closer to the solution. In addition,
1Note that an n × n matrix of rank r depends upon r(2n − r) degrees of freedom.
2
none of these general purpose solvers use the fact that the solution may have low
rank. We refer the reader to [50] for some recent progress on interior-point methods
concerning some special nuclear norm-minimization problems.
This paper develops the singular value thresholding algorithm for approximately
solving the nuclear norm minimization problem (1.1) and by extension, problems of
the form
minimize
subject to
X∗
A(X) = b,
(1.3)
where A is a linear operator acting on the space of n1× n2 matrices and b ∈ Rm. This
algorithm is a simple first-order method, and is especially well suited for problems of
very large sizes in which the solution has low rank. We sketch this algorithm in the
special matrix completion setting and let PΩ be the orthogonal projector onto the
span of matrices vanishing outside of Ω so that the (i, j)th component of PΩ(X) is
equal to Xij if (i, j) ∈ Ω and zero otherwise. Our problem may be expressed as
minimize
subject to
X∗
PΩ(X) = PΩ(M),
(1.4)
with optimization variable X ∈ Rn1×n2. Fix τ > 0 and a sequence {δk}k≥1 of scalar
step sizes. Then starting with Y 0 = 0 ∈ Rn1×n2, the algorithm inductively defines
X k = shrink(Y k−1, τ),
Y k = Y k−1 + δkPΩ(M − X k)
(1.5)
until a stopping criterion is reached. In (1.5), shrink(Y , τ) is a nonlinear function
which applies a soft-thresholding rule at level τ to the singular values of the input
matrix, see Section 2 for details. The key property here is that for large values of τ,
the sequence {X k} converges to a solution which very nearly minimizes (1.4). Hence,
at each step, one only needs to compute at most one singular value decomposition
and perform a few elementary matrix additions. Two important remarks are in order:
1. Sparsity. For each k ≥ 0, Y k vanishes outside of Ω and is, therefore, sparse,
a fact which can be used to evaluate the shrink function rapidly.
2. Low-rank property. The matrices X k turn out to have low rank, and hence
the algorithm has minimum storage requirement since we only need to keep
principal factors in memory.
Our numerical experiments demonstrate that the proposed algorithm can solve
problems, in Matlab, involving matrices of size 30, 000 × 30, 000 having close to a
billion unknowns in 17 minutes on a standard desktop computer with a 1.86 GHz
CPU (dual core with Matlab’s multithreading option enabled) and 3 GB of memory.
As a consequence, the singular value thresholding algorithm may become a rather
powerful computational tool for large scale matrix completion.
1.3. General formulation. The singular value thresholding algorithm can be
adapted to deal with other types of convex constraints. For instance, it may address
problems of the form
minimize
subject to
X∗
fi(X) ≤ 0,
i = 1, . . . , m,
(1.6)
where each fi is a Lipschitz convex function (note that one can handle linear equality
constraints by considering pairs of affine functionals). In the simpler case where the
3
fi’s are affine functionals, the general algorithm goes through a sequence of iterations
which greatly resemble (1.5). This is useful because this enables the development of
numerical algorithms which are effective for recovering matrices from a small subset
of sampled entries possibly contaminated with noise.
1.4. Contents and notations. The rest of the paper is organized as follows.
In Section 2, we derive the singular value thresholding (SVT) algorithm for the ma-
trix completion problem, and recasts it in terms of a well-known Lagrange multiplier
algorithm. In Section 3, we extend the SVT algorithm and formulate a general itera-
tion which is applicable to general convex constraints. In Section 4, we establish the
convergence results for the iterations given in Sections 2 and 3. We demonstrate the
performance and effectiveness of the algorithm through numerical examples in Section
5, and review additional implementation details. Finally, we conclude the paper with
a short discussion in Section 6.
Before continuing, we provide here a brief summary of the notations used through-
out the paper. Matrices are bold capital, vectors are bold lowercase and scalars or
entries are not bold. For instance, X is a matrix and Xij its (i, j)th entry. Likewise,
x is a vector and xi its ith component. The nuclear norm of a matrix is denoted by
X∗, the Frobenius norm by XF and the spectral norm by X2; note that these
are respectively the 1-norm, the 2-norm and the sup-norm of the vector of singular val-
ues. The adjoint of a matrix X is X∗ and similarly for vectors. The notation diag(x),
where x is a vector, stands for the diagonal matrix with {xi} as diagonal elements.
We denote by X, Y = trace(X∗Y ) the standard inner product between two matri-
F = X, X). The Cauchy-Schwarz inequality gives X, Y ≤ XFY F
ces (X2
and it is well known that we also have X, Y ≤ X∗Y 2 (the spectral and nuclear
norms are dual from one another), see e.g. [16, 57].
2. The Singular Value Thresholding Algorithm . This section introduces
the singular value thresholding algorithm and discusses some of its basic properties.
We begin with the definition of a key building block, namely, the singular value
thresholding operator.
2.1. The singular value shrinkage operator. Consider the singular value
decomposition (SVD) of a matrix X ∈ Rn1×n2 of rank r
X = UΣV ∗, Σ = diag({σi}1≤i≤r),
(2.1)
where U and V are respectively n1×r and n2×r matrices with orthonormal columns,
and the singular values σi are positive (unless specified otherwise, we will always
assume that the SVD of a matrix is given in the reduced form above). For each τ ≥ 0,
we introduce the soft-thresholding operator Dτ defined as follows:
Dτ (X) := UDτ (Σ)V ∗, Dτ (Σ) = diag({σi − τ)+}),
(2.2)
where t+ is the positive part of t, namely, t+ = max(0, t). In words, this operator
simply applies a soft-thresholding rule to the singular values of X, effectively shrinking
these towards zero. This is the reason why we will also refer to this transformation
as the singular value shrinkage operator. Even though the SVD may not be unique,
it is easy to see that the singular value shrinkage operator is well defined and we
do not elaborate further on this issue. In some sense, this shrinkage operator is a
straightforward extension of the soft-thresholding rule for scalars and vectors.
In
particular, note that if many of the singular values of X are below the threshold
4
τ, the rank of Dτ (X) may be considerably lower than that of X, just like the soft-
thresholding rule applied to vectors leads to sparser outputs whenever some entries
of the input are below threshold.
the nuclear norm. Details about the proximity operator can be found in e.g. [42].
The singular value thresholding operator is the proximity operator associated with
Theorem 2.1. 2For each τ ≥ 0 and Y ∈ Rn1×n2, the singular value shrinkage
operator (2.2) obeys
Dτ (Y ) = arg min
X
1
2
X − Y 2
F + τX∗.
(2.3)
Proof. Since the function h0(X) := τX∗ + 1
F is strictly convex, it is
easy to see that there exists a unique minimizer, and we thus need to prove that it is
equal to Dτ (Y ). To do this, recall the definition of a subgradient of a convex function
f : Rn1×n2 → R. We say that Z is a subgradient of f at X0, denoted Z ∈ ∂f(X0), if
(2.4)
f(X) ≥ f(X0) + Z, X − X0
2X − Y 2
for all X. Now ˆX minimizes h0 if and only if 0 is a subgradient of the functional h0
at the point ˆX, i.e.
0 ∈ ˆX − Y + τ ∂ ˆX∗,
(2.5)
where ∂ ˆX∗ is the set of subgradients of the nuclear norm. Let X ∈ Rn1×n2 be an
arbitrary matrix and UΣV ∗ be its SVD. It is known [16, 46, 65] that
∂X∗ =U V ∗ + W : W ∈ Rn1×n2, U∗W = 0, W V = 0, W2 ≤ 1 . (2.6)
Set ˆX := Dτ (Y ) for short. In order to show that ˆX obeys (2.5), decompose the
SVD of Y as Y = U0Σ0V ∗
1 , where U0, V0 (resp. U1, V1) are the singular
vectors associated with singular values greater than τ (resp. smaller than or equal to
τ). With these notations, we have ˆX = U0(Σ0 − τ I)V ∗
0 + U1Σ1V ∗
0 and, therefore,
Y − ˆX = τ(U0V ∗
By definition, U∗
0 W = 0, W V0 = 0 and since the diagonal elements of Σ1 have
magnitudes bounded by τ, we also have W2 ≤ 1. Hence Y − ˆX ∈ τ ∂ ˆX∗, which
concludes the proof.
0 + W ), W = τ−1U1Σ1V ∗
1 .
2.2. Shrinkage iterations. We are now in the position to introduce the singular
value thresholding algorithm. Fix τ > 0 and a sequence {δk} of positive step sizes.
Starting with Y0, inductively define for k = 1, 2, . . .,
X k = Dτ (Y k−1),
Y k = Y k−1 + δkPΩ(M − X k)
(2.7)
until a stopping criterion is reached (we postpone the discussion this stopping criterion
and of the choice of step sizes). This shrinkage iteration is very simple to implement.
2One reviewer pointed out that a similar result had been mentioned in a talk given by Donald
Goldfarb at the Foundations of Computational Mathematics conference which took place in Hong
Kong in June 2008.
5
At each step, we only need to compute an SVD and perform elementary matrix
operations. With the help of a standard numerical linear algebra package, the whole
algorithm can be coded in just a few lines. As we will see later, the iteration (2.7) is
the linearized Bregman iteration, which is a special instance of Uzawa’s algorithm.
Before addressing further computational issues, we would like to make explicit the
relationship between this iteration and the original problem (1.1). In Section 4, we
will show that the sequence {X k} converges to the unique solution of an optimization
problem closely related to (1.1), namely,
minimize
subject to
2X2
τX∗ + 1
PΩ(X) = PΩ(M).
F
(2.8)
Furthermore, it is intuitive that the solution to this modified problem converges to
that of (1.4) as τ → ∞ as shown in Section 3. Thus by selecting a large value of the
parameter τ, the sequence of iterates converges to a matrix which nearly minimizes
(1.1).
ideally suited for matrix completion.
As mentioned earlier, there are two crucial properties which make this algorithm
• Low-rank property. A remarkable empirical fact is that the matrices in the
sequence {X k} have low rank (provided, of course, that the solution to (2.8)
has low rank). We use the word “empirical” because all of our numerical ex-
periments have produced low-rank sequences but we cannot rigorously prove
that this is true in general. The reason for this phenomenon is, however,
simple: because we are interested in large values of τ (as to better approxi-
mate the solution to (1.1)), the thresholding step happens to ‘kill’ most of the
small singular values and produces a low-rank output. In fact, our numerical
results show that the rank of X k is nondecreasing with k, and the maximum
rank is reached in the last steps of the algorithm, see Section 5.
Thus, when the rank of the solution is substantially smaller than either di-
mension of the matrix, the storage requirement is low since we could store
each X k in its SVD form (note that we only need to keep the current iterate
and may discard earlier values).
• Sparsity. Another important property of the SVT algorithm is that the it-
eration matrix Y k is sparse. Since Y 0 = 0, we have by induction that Y k
vanishes outside of Ω. The fewer entries available, the sparser Y k. Because
the sparsity pattern Ω is fixed throughout, one can then apply sparse matrix
techniques to save storage. Also, if |Ω| = m, the computational cost of up-
dating Y k is of order m. Moreover, we can call subroutines supporting sparse
matrix computations, which can further reduce computational costs.
One such subroutine is the SVD. However, note that we do not need to com-
pute the entire SVD of Y k to apply the singular value thresholding operator.
Only the part corresponding to singular values greater than τ is needed.
Hence, a good strategy is to apply the iterative Lanczos algorithm to com-
pute the first few singular values and singular vectors. Because Y k is sparse,
Y k can be applied to arbitrary vectors rapidly, and this procedure offers a
considerable speedup over naive methods.
2.3. Relation with other works. Our algorithm is inspired by recent work in
the area of 1 minimization, and especially by the work on linearized Bregman itera-
tions for compressed sensing, see [11–13, 27, 56, 67] for linearized Bregman iterations
and [17,19–21,30] for some information about the field of compressed sensing. In this
6
line of work, linearized Bregman iterations are used to find the solution to an under-
determined system of linear equations with minimum 1 norm.In fact, Theorem 2.1
asserts that the singular value thresholding algorithm can be formulated as a linearized
Bregman iteration. Bregman iterations were first introduced in [55] as a convenient
tool for solving computational problems in the imaging sciences, and a later paper [67]
showed that they were useful for solving 1-norm minimization problems in the area of
compressed sensing. Linearized Bregman iterations were proposed in [27] to improve
performance of plain Bregman iterations, see also [67]. Additional details together
with a technique for improving the speed of convergence called kicking are described
in [56]. On the practical side, the paper [13] applied Bregman iterations to solve a
deblurring problem while on the theoretical side, the references [11, 12] gave a rigor-
ous analysis of the convergence of such iterations. New developments keep on coming
out at a rapid pace and recently, [39] introduced a new iteration, the split Bregman
iteration, to extend Bregman-type iterations (such as linearized Bregman iterations)
to problems involving the minimization of 1-like functionals such as total-variation
norms, Besov norms, and so forth.
When applied to 1-minimization problems, linearized Bregman iterations are se-
quences of soft-thresholding rules operating on vectors.
Iterative soft-thresholding
algorithms in connection with 1 or total-variation minimization have quite a bit of
history in signal and image processing and we would like to mention the works [14,48]
for total-variation minimization, [28,29,36] for 1 minimization, and [5,9,10,22,23,32,
33,59] for some recent applications in the area of image inpainting and image restora-
tion. Just as iterative soft-thresholding methods are designed to find sparse solutions,
our iterative singular value thresholding scheme is designed to find a sparse vector of
singular values. In classical problems arising in the areas of compressed sensing, and
signal or image processing, the sparsity is expressed in a known transformed domain
and soft-thresholding is applied to transformed coefficients. In contrast, the shrinkage
operator Dτ is adaptive. The SVT not only discovers a sparse singular vector but also
the bases in which we have a sparse representation. In this sense, the SVT algorithm
is an extension of earlier iterative soft-thresholding schemes.
Finally, we would like to contrast the SVT iteration (2.7) with the popular it-
erative soft-thresholding algorithm used in many papers in imaging processing and
perhaps best known under the name of Proximal Forward-Backward Splitting method
(PFBS), see [10,26,28,36,40,63,64] for example. The constrained minimization prob-
lem (1.4) may be relaxed into
minimize λX∗ +
PΩ(X) − PΩ(M)2
1
2
(2.9)
for some λ > 0. Theorem 2.1 asserts that Dλ is the proximity operator of λX∗
and Proposition 3.1(iii) in [26] gives that the solution to this unconstrained problem is
characterized by the fixed point equation X = Dλδ(X +δPΩ(M −X)) for each δ > 0.
One can then apply a simplified version of the PFBS method (see (3.6) in [26]) to
obtain iterations of the form X k = Dλδk−1(X k−1 +δk−1PΩ(M −X k−1)). Introducing
an intermediate matrix Y k, this algorithm may be expressed as
F
X k = Dλδk−1(Y k−1),
Y k = X k + δkPΩ(M − X k).
(2.10)
The difference with (2.7) may seem subtle at first—replacing X k in (2.10) with Y k−1
and setting δk = δ gives (2.7) with τ = λδ—but has enormous consequences as
7
this gives entirely different algorithms. First, they have different limits: while (2.7)
converges to the solution of the constrained minimization (2.8), (2.10) converges to
the solution of (2.9) provided that the sequence of step sizes is appropriately selected.
Second, selecting a large λ (or a large value of τ = λδ) in (2.10) gives a low-rank
sequence of iterates and a limit with small nuclear norm. The limit, however, does
not fit the data and this is why one has to choose a small or moderate value of λ (or
of τ = λδ). However, when λ is not sufficiently large, the X k’s may not have low
rank even though the solution has low rank (and one may need to compute many
singular vectors), and thus applying the shrinkage operation accurately to Y k may be
computationally very expensive. Moreover, the limit does not necessary have a small
nuclear norm. These are some of the reasons why (2.10) does not seems to be very
suitable for very large-scale matrix completion problems (in cases where computing
the SVD is prohibitive). Since the original submission of this paper, however, we note
that several papers proposed some working implementations [51, 61].
2.4. Interpretation as a Lagrange multiplier method. In this section, we
recast the SVT algorithm as a type of Lagrange multiplier algorithm known as Uzawa’s
algorithm. An important consequence is that this will allow us to extend the SVT
algorithm to other problems involving the minimization of the nuclear norm under
convex constraints, see Section 3. Further, another contribution of this paper is that
this framework actually recasts linear Bregman iterations as a very special form of
Uzawa’s algorithm, hence providing fresh and clear insights about these iterations.
In what follows, we set fτ (X) = τX∗ + 1
2X2
F for some fixed τ > 0 and recall
that we wish to solve (2.8)
minimize
subject to
fτ (X)
PΩ(X) = PΩ(M).
The Lagrangian for this problem is given by L(X, Y ) = fτ (X) + Y ,PΩ(M − X),
where Y ∈ Rn1×n2. Strong duality holds and X and Y are primal-dual optimal if
(X , Y ) is a saddlepoint of the Lagrangian L(X, Y ), i.e. a pair obeying
inf
X
L(X, Y ) = L(X , Y ) = inf
sup
Y
(2.11)
The function g0(Y ) = inf X L(X, Y ) is called the dual function. Here, g0 is continu-
ously differentiable and has a gradient which is Lipschitz with Lipschitz constant at
most one, as this is a consequence of well-known results concerning conjugate func-
tions. Uzawa’s algorithm approaches the problem of finding a saddlepoint with an
iterative procedure. From Y0 = 0, say, inductively define
sup
Y
L(X, Y ).
X
L(X k, Y k−1) = minX L(X, Y k−1)
Y k = Y k−1 + δkPΩ(M − X k),
(2.12)
where {δk}k≥1 is a sequence of positive step sizes. Uzawa’s algorithm is, in fact, a
subgradient method applied to the dual problem, where each step moves the current
iterate in the direction of the gradient or of a subgradient. Indeed, observe that the
gradient of g0(Y ) is given by
∂Y g0(Y ) = ∂Y L( ˜X, Y ) = PΩ(M − ˜X),
(2.13)
where ˜X is the minimizer of the Lagrangian for that value of Y so that a gradient
descent update for Y is of the form
Y k = Y k−1 + δk∂Y g0(Y k−1) = Y k−1 + δkPΩ(M − X k).
8