logo资料库

拉普拉斯特征映射.pdf

第1页 / 共25页
第2页 / 共25页
第3页 / 共25页
第4页 / 共25页
第5页 / 共25页
第6页 / 共25页
第7页 / 共25页
第8页 / 共25页
资料共25页,剩余部分请下载后查看
LETTER Communicated by Joshua B. Tenenbaum Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Mikhail Belkin misha@math.uchicago.edu Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A. Partha Niyogi niyogi@cs.uchicago.edu Department of Computer Science and Statistics, University of Chicago, Chicago, IL 60637 U.S.A. One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a low- dimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the high- dimensional data. The algorithm provides a computationally efficient ap- proach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential appli- cations and illustrative examples are discussed. 1 Introduction In many areas of artificial intelligence, information retrieval, and data min- ing, one is often confronted with intrinsically low-dimensional data lying in a very high-dimensional space. Consider, for example, gray-scale images of an object taken under fixed lighting conditions with a moving camera. Each such image would typically be represented by a brightness value at each pixel. If there were n2 pixels in all (corresponding to an n × n image), then each image yields a data point in Rn2. However, the intrinsic dimensional- ity of the space of all images of the same object is the number of degrees of freedom of the camera. In this case, the space under consideration has the natural structure of a low-dimensional manifold embedded in Rn2. Recently, there has been some renewed interest (Tenenbaum, de Silva, & Langford, 2000; Roweis & Saul, 2000) in the problem of developing low- dimensional representations when data arise from sampling a probabil- ity distribution on a manifold. In this letter, we present a geometrically c 2003 Massachusetts Institute of Technology Neural Computation 15, 1373–1396 (2003)
1374 M. Belkin and P. Niyogi motivated algorithm and an accompanying framework of analysis for this problem. The general problem of dimensionality reduction has a long history. Clas- sical approaches include principal components analysis (PCA) and multi- dimensional scaling. Various methods that generate nonlinear maps have also been considered. Most of them, such as self-organizing maps and other neural network–based approaches (e.g., Haykin, 1999), set up a nonlin- ear optimization problem whose solution is typically obtained by gradient descent that is guaranteed only to produce a local optimum; global op- tima are difficult to attain by efficient means. Note, however, that the re- cent approach of generalizing the PCA through kernel-based techniques (Sch¨olkopf, Smola, & M ¨uller, 1998) does not have this shortcoming. Most of these methods do not explicitly consider the structure of the manifold on which the data may possibly reside. In this letter, we explore an approach that builds a graph incorporating neighborhood information of the data set. Using the notion of the Laplacian of the graph, we then compute a low-dimensional representation of the data set that optimally preserves local neighborhood information in a certain sense. The representation map generated by the algorithm may be viewed as a discrete approximation to a continuous map that naturally arises from the geometry of the manifold. It is worthwhile to highlight several aspects of the algorithm and the framework of analysis presented here: • The core algorithm is very simple. It has a few local computations and one sparse eigenvalue problem. The solution reflects the intrinsic geo- metric structure of the manifold. It does, however, require a search for neighboring points in a high-dimensional space. We note that there are several efficient approximate techniques for finding nearest neighbors (e.g., Indyk, 2000). • The justification for the algorithm comes from the role of the Laplace Beltrami operator in providing an optimal embedding for the mani- fold. The manifold is approximated by the adjacency graph computed from the data points. The Laplace Beltrami operator is approximated by the weighted Laplacian of the adjacency graph with weights cho- sen appropriately. The key role of the Laplace Beltrami operator in the heat equation enables us to use the heat kernel to choose the weight decay function in a principled manner. Thus, the embedding maps for the data approximate the eigenmaps of the Laplace Beltrami operator, which are maps intrinsically defined on the entire manifold. • The framework of analysis presented here makes explicit use of these connections to interpret dimensionality-reduction algorithms in a ge- ometric fashion. In addition to the algorithms presented in this letter, we are also able to reinterpret the recently proposed locally linear em-
Laplacian Eigenmaps 1375 bedding (LLE) algorithm of Roweis and Saul (2000) within this frame- work. The graph Laplacian has been widely used for different clustering and partition problems (Shi & Malik, 1997; Simon, 1991; Ng, Jordan, & Weiss, 2002). Although the connections between the Laplace Beltrami operator and the graph Laplacian are well known to geometers and specialists in spectral graph theory (Chung, 1997; Chung, Grigor’yan, & Yau, 2000), so far we are not aware of any application to dimen- sionality reduction or data representation. We note, however, recent work on using diffusion kernels on graphs and other discrete struc- tures (Kondor & Lafferty, 2002). • The locality-preserving character of the Laplacian eigenmap algorithm makes it relatively insensitive to outliers and noise. It is also not prone to short circuiting, as only the local distances are used. We show that by trying to preserve local information in the embedding, the algorithm implicitly emphasizes the natural clusters in the data. Close connec- tions to spectral clustering algorithms developed in learning and com- puter vision (in particular, the approach of Shi & Malik, 1997) then become very clear. In this sense, dimensionality reduction and cluster- ing are two sides of the same coin, and we explore this connection in some detail. In contrast, global methods like that in Tenenbaum et al. (2000), do not show any tendency to cluster, as an attempt is made to preserve all pairwise geodesic distances between points. However, not all data sets necessarily have meaningful clusters. Other methods such as PCA or Isomap might be more appropriate in that case. We will demonstate, however, that at least in one example of such a data set ( the “swiss roll”), our method produces reasonable results. • Since much of the discussion of Seung and Lee (2000), Roweis and Saul (2000), and Tenenbaum et al. (2000) is motivated by the role that nonlinear dimensionality reduction may play in human perception and learning, it is worthwhile to consider the implication of the pre- vious remark in this context. The biological perceptual apparatus is confronted with high-dimensional stimuli from which it must recover low-dimensional structure. If the approach to recovering such low- dimensional structure is inherently local (as in the algorithm proposed here), then a natural clustering will emerge and may serve as the basis for the emergence of categories in biological perception. • Since our approach is based on the intrinsic geometric structure of the manifold, it exhibits stability with respect to the embedding. As long as the embedding is isometric, the representation will not change. In the example with the moving camera, different resolutions of the cam- era (i.e., different choices of n in the n × n image grid) should lead to embeddings of the same underlying manifold into spaces of very dif-
1376 M. Belkin and P. Niyogi ferent dimension. Our algorithm will produce similar representations independent of the resolution. The generic problem of dimensionality reduction is the following. Given a set x1, . . . , xk of k points in Rl, find a set of points y1, . . . , yk in Rm (m l) such that yi “represents” xi. In this letter, we consider the special case where x1, . . . , xk ∈ M and M is a manifold embedded in Rl. We now consider an algorithm to construct representative yi’s for this special case. The sense in which such a representation is optimal will become clear later in this letter. 2 The Algorithm Given k points x1, . . . , xk in Rl, we construct a weighted graph with k nodes, one for each point, and a set of edges connecting neighboring points. The embedding map is now provided by computing the eigenvectors of the graph Laplacian. The algorithmic procedure is formally stated below. 1. Step 1 (constructing the adjacency graph). We put an edge between nodes i and j if xi and xj are “close.” There are two variations: (a) -neighborhoods (parameter  ∈ R). Nodes i and j are con- nected by an edge if xi − xj2 <  where the norm is the usual Euclidean norm in Rl. Advantages: Geometrically motivated, the relationship is naturally symmetric. Disadvantages: Often leads to graphs with several connected components, difficult to choose . (b) n nearest neighbors (parameter n ∈ N). Nodes i and j are con- nected by an edge if i is among n nearest neighbors of j or j is among n nearest neighbors of i. Note that this relation is sym- metric. Advantages: Easier to choose; does not tend to lead to disconnected graphs. Disadvantages: Less geometrically intu- itive. 2. Step 2 (choosing the weights).1 Here as well, we have two variations for weighting the edges: (a) Heat kernel (parameter t ∈ R). If nodes i and j are connected, put Wij = e 2 ; − xi−xj t otherwise, put Wij = 0. The justification for this choice of weights will be provided later. 1 In a computer implementation of the algorithm, steps 1 and 2 are executed simultaneously.
Laplacian Eigenmaps 1377 (b) Simple-minded (no parameters (t = ∞)). Wij = 1 if vertices i and j are connected by an edge and Wij = 0 if vertices i and j are not connected by an edge. This simplification avoids the need to choose t. 3. Step 3 (eigenmaps). Assume the graph G, constructed above, is con- nected. Otherwise, proceed with step 3 for each connected component. Compute eigenvalues and eigenvectors for the generalized eigenvec- tor problem, Lf = λDf, row, since W is symmetric) sums of W, Dii = where D is diagonal weight matrix, and its entries are column (or j Wji. L = D − W is the Laplacian matrix. Laplacian is a symmetric, positive semidefinite matrix that can be thought of as an operator on functions defined on vertices of G. (2.1) Let f0, . . . , fk−1 be the solutions of equation 2.1, ordered according to their eigenvalues: Lf0 = λ0Df0 Lf1 = λ1Df1 ··· Lfk−1 = λk−1Dfk−1 0 = λ0 ≤ λ1 ≤ ··· ≤ λk−1. We leave out the eigenvector f0 corresponding to eigenvalue 0 and use the next m eigenvectors for embedding in m-dimensional Euclidean space: xi → (f1(i), . . . , fm(i)). 3 Justification 3.1 Optimal Embeddings. Let us first show that the embedding pro- vided by the Laplacian eigenmap algorithm preserves local information optimally in a certain sense. Chung, 1997, for a comprehensive reference.) The following section is based on standard spectral graph theory. (See Recall that given a data set, we construct a weighted graph G = (V, E) with edges connecting nearby points to each other. For the purposes of this discussion, assume the graph is connected. Consider the problem of mapping the weighted graph G to a line so that connected points stay as close together as possible. Let y = (y1, y2, . . . , yn)T be such a map. A reasonable
1378 M. Belkin and P. Niyogi criterion for choosing a “good” map is to minimize the following objective function, (yi − yj)2Wij, ij 1 2 i,j under appropriate constraints. The objective function with our choice of weights Wij incurs a heavy penalty if neighboring points xi and xj are mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and xj are “close,” then yi and yj are close as well. It turns out that for any y, we have (yi − yj)2Wij = yTLy, Dii = where as before, L = D − W. To see this, notice that Wij is symmetric and j Wij. Thus, (yi − yj)2Wij = = − 2yiyj)Wij j Djj − 2 y2 yiyjWij = 2yTLy. i,j + y2 j (y2 i i Dii + y2 (3.1) i,j i,j i j Note that this calculation also shows that L is positive semidefinite. Therefore, the minimization problem reduces to finding yTLy. argmin yT Dy=1 y The constraint yTDy = 1 removes an arbitrary scaling factor in the embed- ding. Matrix D provides a natural measure on the vertices of the graph. The bigger the value Dii (corresponding to the ith vertex) is, the more “im- portant” is that vertex. It follows from equation 3.1 that L is a positive semidefinite matrix, and the vector y that minimizes the objective function is given by the minimum eigenvalue solution to the generalized eigenvalue problem: Ly = λDy. Let 1 be the constant function taking 1 at each vertex. It is easy to see that 1 is an eigenvector with eigenvalue 0. If the graph is connected, 1 is the only eigenvector for λ = 0. To eliminate this trivial solution, which collapses all vertices of G onto the real number 1, we put an additional constraint of orthogonality and look for yTLy. argmin yT Dy=1 yT D1=0
Laplacian Eigenmaps 1379 Thus, the solution is now given by the eigenvector with the smallest nonzero eigenvalue. The condition yTD1 = 0 can be interpreted as removing a trans- lation invariance in y. Now consider the more general problem of embedding the graph into m-dimensional Euclidean space. The embedding is given by the k×m matrix Y = [y1y2, . . . , ym], where the ith row provides the embedding coordinates of the ith vertex. Similarly, we need to minimize y(i) − y(j)2Wij = tr(YTLY), i,j where y(i) = [y1(i), . . . , ym(i)]T is the m-dimensional representation of the ith vertex. This reduces to finding argmin YTDY=I tr(YTLY). For the one-dimensional embedding problem, the constraint prevents collapse onto a point. For the m-dimensional embedding problem, the con- straint presented above prevents collapse onto a subspace of dimension less than m− 1 (m if, as in one-dimensional case, we require orthogonality to the constant vector). Standard methods show that the solution is provided by the matrix of eigenvectors corresponding to the lowest eigenvalues of the generalized eigenvalue problem Ly = λDy. 3.2 The Laplace Beltrami Operator. The Laplacian of a graph is anal- ogous to the Laplace Beltrami operator on manifolds. In this section, we provide a justification for why the eigenfunctions of the Laplace Beltrami operator have properties desirable for embedding. Let M be a smooth, compact, m-dimensional Riemannian manifold. If the manifold is embedded in Rl, the Riemannian structure (metric tensor) on the manifold is induced by the standard Riemannian structure on Rl. As we did with the graph, we are looking here for a map from the manifold to the real line such that points close together on the manifold are mapped close together on the line. Let f be such a map. Assume that f : M → R is twice differentiable. Consider two neighboring points x, z ∈ M. They are mapped to f (x) and f (z), respectively. We first show that | f (z) − f (x)| ≤ distM(x, z)∇ f (x) + o(distM(x, z)). (3.2) The gradient ∇ f (x) is a vector in the tangent space TMx, such that given another vector v ∈ TMx, df (v) = ∇ f (x), vM. Let l = distM(x, z). Let c(t) be the geodesic curve parameterized by length connecting x = c(0) and z = c(l). Then f (z) = f (x) + l 0 (t)) dt = f (x) + df (c l ∇ f (c(t)), c (t) dt. 0
1380 M. Belkin and P. Niyogi Now by Schwartz inequality, ∇ f (c(t)), c (t) ≤ ∇ f (c(t)) c (t) = ∇ f (c(t)). Since c(t) is parameterized by length, we have c (t) = 1. We also have ∇ f (c(t)) = ∇ f (x) + O(t) (by Taylor’s approximation). Finally, by inte- grating, we have | f (z) − f (x)| ≤ l∇ f (x) + o(l), If M is isometrically embedded in Rl, then distM(x, z) = x − zRl + where both O and o are used in the infinitesimal sense. o(x − zRl ) and | f (z) − f (x)| ≤ ∇ f (x) z − x + o(z − x). Thus, we see that ∇ f provides us with an estimate of how far apart f maps nearby points. We therefore look for a map that best preserves locality on average by trying to find argmin =1 f L2 (M) M ∇ f (x)2, (3.3) where the integral is taken with respect to the standard measure on a Rie- M ∇ f (x)2 corresponds to min- mannian manifold. Note that minimizing ( fi − fj)2Wij on a graph. Here, f is a function on vertices, imizing Lf = 1 and fi is the value of f on the ith node of the graph. to finding eigenfunctions of the Laplace Beltrami operator L. Recall that It turns out that minimizing the objective function of equation 3.3 reduces i,j 2 L f def= − div∇( f ), ∇ f2 = M M L( f ) f. where div is the divergence of the vector field. It follows from the Stokes’ theorem that − div and ∇ are formally adjoint operators, that is, if f is a function and X is a vector field, then MX,∇ f = − M div(X) f . Thus, We see that L is positive semidefinite. f that minimizes M ∇ f2 has to be an eigenfunction of L. The spectrum of L on a compact manifold M is known to be discrete (Rosenberg, 1997). Let the eigenvalues (in increasing order) be 0 = λ0 ≤ λ1 ≤ λ2 ≤ . . . , and let fi be the eigenfunction corre- sponding to eigenvalue λi. It is easily seen that f0 is the constant function that maps the entire manifold to a single point. To avoid this eventuality, we
分享到:
收藏