logo资料库

Graph_Kernels简要版.pdf

第1页 / 共41页
第2页 / 共41页
第3页 / 共41页
第4页 / 共41页
第5页 / 共41页
第6页 / 共41页
第7页 / 共41页
第8页 / 共41页
资料共41页,剩余部分请下载后查看
Introduction
Background
Linear Algebra Concepts
Graph Concepts
Paper Outline
Random Walk Graph Kernels
Direct Product Graphs
Kernel Definition
Special Cases
Efficient Computation
Sylvester Equation Methods
Conjugate Gradient Methods
Fixed-Point Iterations
Geometric Kernel
Experiments
Randomly Generated Graphs
Real-World Datasets
Unlabeled Graphs
Labeled Graphs
Protein Interaction Networks
Co-Integration of Gene Expression and PPI Data
Composite Graph Kernel
Datasets
Results
Rational Kernels
Semiring
Weighted Transducers
Kernel Definition
Kernels on Weighted Transducers
Recovering Random Walk Graph Kernels
R-convolution Kernels
Graph Kernels as R-Convolutions
R-Convolutions in Abstract Semirings
Diffusion-Based Graph Kernels?
Diffusion Kernels on Graph Vertices
Cartesian Product Graph Kernels
Efficient Computation of Cartesian Product Kernels
A Deficiency of Diffusion-Based Graph Kernels
Outlook and Discussion
Extending Linear Algebra to RKHS
Matrix Product
Kronecker Product
Heterogeneous Kronecker Product
Kronecker Sum
Hadamard Product
Cartesian Product Kernels: Proof of Lemma 9
8 0 0 2 l u J 1 ] G L . s c [ 1 v 3 9 0 0 . 7 0 8 0 : v i X r a Journal of Machine Learning Research 9 (2008) 1–41 Submitted 05/08; Published xx/08 Graph Kernels SVN.Vishwanathan@nicta.com.au kmb51@cam.ac.uk risi@gatsby.ucl.ac.uk jmlr@schraudolph.org S.V. N. Vishwanathan College of Engineering and Computer Science Australian National University and NICTA Locked Bag 8001, Canberra ACT 2601, Australia Karsten M. Borgwardt Machine Learning Group, Department of Engineering University of Cambridge Trumpington Street, CB2 1PZ Cambridge, United Kingdom Imre Risi Kondor Gatsby Computational Neuroscience Unit University College London 17 Queen Square, WC1N 3AR London, United Kingdom Nicol N. Schraudolph College of Engineering and Computer Science Australian National University and NICTA Locked Bag 8001, Canberra ACT 2601, Australia Editor: U. N. Known Abstract We present a unified framework to study graph kernels, special cases of which include the random walk graph kernel (G¨artner et al., 2003; Borgwardt et al., 2005), marginal- ized graph kernel (Kashima et al., 2003, 2004; Mah´e et al., 2004), and geometric kernel on graphs (G¨artner, 2002). Through extensions of linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) and reduction to a Sylvester equation, we construct an algorithm that improves the time complexity of kernel computation from O(n6) to O(n3). When the graphs are sparse, conjugate gradient solvers or fixed-point iterations bring our algo- rithm into the sub-cubic domain. Experiments on graphs from bioinformatics and other application domains show that it is often more than a thousand times faster than previous approaches. We then explore connections between diffusion kernels (Kondor and Lafferty, 2002), regularization on graphs (Smola and Kondor, 2003), and graph kernels, and use these connections to propose new graph kernels. Finally, we show that rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to the random walk graph kernel. Keywords: Graph kernels, Linear Algebra in RKHS, Sylvester Equations, Bioinformatics, Rational Kernels, Transducers, Semirings, Diffusion, Random Walks, Regularization on Graphs. c2008 S.V. N. Vishwanathan, Karsten M. Borgwardt, Imre Risi Kondor, and Nicol N. Schraudolph.
Vishwanathan, Borgwardt, Kondor, and Schraudolph 1. Introduction We begin by providing some background, establishing some basic notation, and giving an outline of the paper. 1.1 Background Machine learning in domains such as bioinformatics (Sharan and Ideker, 2006), chemoin- formatics (Bonchev and Rouvray, 1991), drug discovery (Kubinyi, 2003), web data mining (Washio and Motoda, 2003), and social networks (Kumar et al., 2006) involves the study of relationships between structured objects. Graphs are natural data structures to model such structures, with nodes representing objects and edges the relations between them. In this context, one often encounters two questions: “How similar are two nodes in a given graph?” and “How similar are two graphs to each other?” Kernel methods (Sch¨olkopf and Smola, 2002) offer a natural framework to study these questions. Roughly speaking, a kernel k(x, x) is a measure of similarity between objects x and x. It must satisfy two mathematical requirements: it must be symmetric, that is, k(x, x) = k(x, x), and positive semi-definite (p.s.d.). Comparing nodes in a graph involves constructing a kernel between nodes, while comparing graphs involves constructing a kernel between graphs. In both cases, the challenge is to define a kernel that captures the semantics inherent in the graph structure but at the same time is reasonably efficient to evaluate. Until now, the above two types of kernels have largely been studied separately. The idea of constructing kernels on graphs (i.e., between the nodes of a single graph) was first proposed by Kondor and Lafferty (2002), and extended by Smola and Kondor (2003). Kernels between graphs were proposed by G¨artner (2002) (geometric kernels on graphs) and G¨artner et al. (2003) (random walk graph kernels), and later extended by Borgwardt et al. (2005). Much at the same time, the idea of marginalized kernels (Tsuda et al., 2002) was extended to graphs by Kashima et al. (2003, 2004), and further refined by Mah´e et al. (2004). A seemingly independent line of research investigates the so-called rational kernels, which are kernels between finite state automata based on the algebra of abstract semirings (Cortes et al., 2004, 2003, 2002). The aim of this paper is twofold: on one hand we present theoretical results showing that these four strands of research are in fact closely related, on the other we present new algorithms for efficiently computing kernels between graphs. Towards this end we first establish some notation and review pertinent concepts from linear algebra and graph theory. 1.2 Linear Algebra Concepts We use ei to denote the ith standard basis vector (that is, a vector of all zeros with the ith entry set to one), e to denote a vector with all entries set to one, 0 to denote the vector of all zeros, and I to denote the identity matrix. When it is clear from context we will not mention the dimensions of these vectors and matrices. 2
Graph Kernels Definition 1 Given real matrices A ∈ Rn×m and B ∈ Rp×q, the Kronecker product A⊗B ∈ Rnp×mq and column-stacking operator vec(A) ∈ Rnm are defined as  A11B A12B . . . A1mB ... ... ... ... A ⊗ B :=  , vec(A) :=  A∗1 ... A∗m  , An1B An2B . . . AnmB where A∗j denotes the jth column of A. Kronecker product and vec operator are linked by the well-known property (e.g., Bernstein, 2005, proposition 7.1.9): vec(ABC) = (C⊗ A) vec(B). (1) Another well-known property of the Kronecker product, which we use in Section 5, is (Bernstein, 2005, proposition 7.1.6): (A ⊗ B)(C ⊗ D) = AC ⊗ BD. (2) A closely related concept is that of the Kronecker sum which is defined for real matrices A ∈ Rn×m and B ∈ Rp×q as A ⊕ B := A ⊗ Ipq + Inm ⊗ B, (3) with Inm (resp. Ipq) denoting the n× m (resp. p× q) identity matrix. Many of its properties can be derived from those of the Kronecker product. Finally, the Hadamard product of two real matrices A, B ∈ Rn×m, denoted by A B ∈ Rn×m, is obtained by element-wise multiplication. It interacts with the Kronecker product via (A ⊗ B) (C ⊗ D) = (A C) ⊗ (B D). (4) All the above concepts can be extended to a Reproducing Kernel Hilbert Space (RKHS) (See Appendix A for details). 1.3 Graph Concepts A graph G consists of an ordered set of n vertices V = {v1, v2, . . . , vn}, and a set of edges E ⊂ V × V . A vertex vi is said to be a neighbor of another vertex vj if they are connected by an edge, i.e., if (vi, vj) ∈ E; this is also denoted vi ∼ vj. A walk of length t on G is a sequence of indices i1, i2, . . . it+1 such that vik ∼ vik+1 for all 1 ≤ k ≤ t. A graph is said to be connected if any two pairs of vertices can be connected by a walk. In this paper we will always work with connected graphs. A graph is said to be undirected if (vi, vj) ∈ E ⇐⇒ (vj, vi) ∈ E. In much of the following we will be dealing with weighted graphs, which are a slight In a weighted graph, each edge (vi, vj) has an associated generalization of the above. weight wij > 0 signifying its “strength”. If vi and vj are not neighbors, then wij = 0. In an undirected weighted graph wij = wji. 3
Vishwanathan, Borgwardt, Kondor, and Schraudolph vi ∼ vj, and 0 otherwise. The adjacency matrix of a weighted graph is just the matrix In both cases, if G is undirected, then the adjacency matrix is The adjacency matrix of an unweighted graph is an n × n matrix A with Aij = 1 if of weights, Aij = wij. symmetric. The diagonal entries of A are always zero. The adjacency matrix has a normalized cousin, defined A := D−1A, which has the Dii = di = property that each of its rows sums to one, therefore it can serve as the transition matrix jAij. A random walk on G is a process generating sequences of vertices for a stochastic process. Here, D is a diagonal matrix of node degrees, di, such that vi1, vi2, vi3, . . . according to P(ik+1|i1, . . . ik) = Aik,ik+1, that is, the probability at vik of picking vik+1 next is proportional to the weight of the edge (vik , vik+1). The tth power of A thus describes t-length walks, i.e., [At]ij is the probability of a transition from vertex vi to vertex vj via a walk of length t. If p0 is an initial probability distribution over vertices, the probability distribution pt describing the location of our random walker at time t is pt = Atp0. The jth component of pt denotes the probability of finishing a t-length walk at vertex vj. We will use this intuition to define generalized random walk graph kernels. Let X be a set of labels which includes the special label ζ. Every edge-labeled graph G is associated with a label matrix X ∈ X n×n such that Xij = ζ iff (vi, vj) /∈ E, in other words only those edges which are present in the graph get a non-ζ label. Let H be the RKHS endowed with the kernel κ : X ×X → R, and let φ : X → H denote the corresponding feature map which maps ζ to the zero element of H. We use Φ(X) to denote the feature matrix of G (see Appendix A for details). For ease of exposition we do not consider labels on vertices here, though our results hold for that case as well. Henceforth we use the term labeled graph to denote an edge-labeled graph. 1.4 Paper Outline In the first part of this paper (Sections 2–4) we present a unifying framework for graph kernels, encompassing many known kernels as special cases, and connecting to others. We describe our framework in Section 2, prove that it leads to p.s.d. kernels, and discuss random walk graph kernels, geometric kernels on graphs, and marginalized graph kernels as special cases. For ease of exposition we will work with real matrices in the main body of the paper and relegate the RKHS extensions to Appendix A. In Section 3 we present three efficient ways to compute random walk graph kernels, namely 1. via reduction to a Sylvester equation, 2. using a conjugate gradient (CG) solver, and 3. using a fixed point iteration. Experiments on a variety of real and synthetic datasets in Section 4 illustrate the computational advantages of our approach, which reduces the time complexity of kernel computations from O(n6) to O(n3). In the second part (Sections 5–7) we draw further connections to existing kernels on structured objects. In Section 5 we present a simple proof that rational kernels are p.s.d., and show that specializing them to graphs yields random walk graph kernels. In Section 6 we discuss the relation between R-convolution kernels and various incarnations of graph kernels. In fact, all known graph kernels can be shown to be instances of R-convolution kernels. We also show that extending the framework therough the use of semirings does not always result in a p.s.d. kernel; a case in point is the optimal assignment kernel of Fr¨ohlich et al. (2006). In Section 7 we shift our attention to diffusion processes on graphs and associated kernels; 4
Graph Kernels this leads us to propose new kernels on graphs, based on the Cartesian graph product. We show that the efficient computation techniques we introduced in Section 3 are also applicable here, but are ultimately forced to conclude that diffusion-based graph kernels are not useful in a general context. We conclude in Section 8 with an outlook and discussion. 2. Random Walk Graph Kernels Our generalized random walk graph kernels are based on a simple idea: given a pair of graphs, perform random walks on both, and count the number of matching walks. We show that this simple concept underlies random walk graph kernels, marginalized graph kernels, and geometric kernels on graphs. In order to do this, we first need to introduce direct product graphs. 2.1 Direct Product Graphs Given two graphs G(V, E) and G(V , E)(with |V | = n and |V | = n), their direct product G× is a graph with vertex set V× = {(vi, v i) : vi ∈ V, v i ∈ V }, (5) and edge set E× = {((vi,v j)) : (vi, vj)∈ E ∧ (v (6) In other words, G× is a graph over pairs of vertices from G and G, and two vertices in G× are neighbors if and only if the corresponding vertices in G and G are both neighbors (see are the respective adjacency matrices of G and i), (vj,v Figure 1 for an illustration). If A and A G, then the adjacency matrix of G× is A× = A⊗A . Similarly, A× = A ⊗ A. i, v j)∈ E}. If G and G are edge-labeled, we can associate a weight matrix W× ∈ Rnn×nn Performing a random walk on the direct product graph is equivalent to performing a simultaneous random walk on G and G (Imrich and Klavˇzar, 2000). If p and p denote initial probability distributions over the vertices of G and G, then the corresponding initial probability distribution on the direct product graph is p× := p⊗ p. Likewise, if q and q are stopping probabilities (that is, the probability that a random walk ends at a given vertex), then the stopping probability on the direct product graph is q× := q ⊗ q. with G× using our Kronecker product in RKHS (Definition 12): W× = Φ(X) ⊗ Φ(X). As a consequence of the definition of Φ(X) and Φ(X), the entries of W× are non-zero only if the corresponding edge exists in the direct product graph. The weight matrix is closely related to the normalized adjacency matrix: assume that H = R endowed with the usual inner product, and φ(Xij) = 1/di if (vi, vj) ∈ E or zero otherwise. Then Φ(X) = A and Φ(X) = A, and consequently W× = A×, that is, the weight matrix is identical to the normalized adjacency matrix of the direct product graph. To extend the above discussion, assume that H = Rd endowed with the usual inner product, and that there are d distinct edge labels {1, 2, . . . , d}. For each edge (vi, vj) ∈ E we have φ(Xij) = el /di if the edge (vi, vj) is labeled l. All other entries of Φ(X) are set to 0. κ is therefore a delta kernel, that is, its value between any two edges is one iff the labels on the edges match, and zero otherwise. The weight matrix W× has a non-zero entry iff an 5
Vishwanathan, Borgwardt, Kondor, and Schraudolph Figure 1: Two graphs (top left & right) and their direct product (bottom). Each node of the direct product graph is labeled with a pair of nodes; an edge exists in the direct product if and only if the corresponding nodes are adjacent in both original graphs. For instance, nodes 11 and 32 are adjacent because there is an edge between nodes 1 and 3 in the first, and 1 and 2 in the second graph. edge exists in the direct product graph and the corresponding edges in G and G have the same label. Let lA denote the normalized adjacency matrix of the graph filtered by the label l, that is, lAij = Aij if Xij = l and zero otherwise. Some simple algebra (omitted for the sake of brevity) shows that the weight matrix of the direct product graph can be written as d W× = lA ⊗ lA. (7) We will show in the sequel that kernels defined using the above weight matrix can be computed efficiently. l=1 2.2 Kernel Definition As stated above, performing a random walk on the direct product graph G× is equivalent to performing a simultaneous random walk on the graphs G and G (Imrich and Klavˇzar, 6 11'21'31'34'24'14'12'22'32'13'23'33'1231'2'3'4'X
Graph Kernels 2000). Therefore, the ((i − 1)n + j, (i − 1)n + j)th entry of Ak× represents the probability of simultaneous k length random walks on G (starting from vertex vi and ending in vertex vj) and G (starting from vertex v j). The entries of W× represent similarity between edges. The ((i − 1)n + j, (i − 1)n + j)th entry of W k× represents the similarity between simultaneous k length random walks on G and G measured via the kernel function κ. and an appropriately chosen discrete measure µ, we can define a kernel on G and G as Given the weight matrix W×, initial and stopping probability distributions p× and q×, i and ending in vertex v k(G, G) := µ(k) q ×W k×p×. (8) ∞ k=1 In order to show that (8) is a valid p.s.d. kernel we need the following technical lemma: ∀ k ∈ N : W k×p× = vec[Φ(X)kp (Φ(X)kp)]. Lemma 2 Proof By induction over k. Base case: k = 1. Observe that p× = (p ⊗ p) vec(1) = vec(pp). (9) By using Lemma 13, W×p× can be written as [Φ(X) ⊗ Φ(X)] vec(pp) = vec[Φ(X)ppΦ(X)] = vec[Φ(X)p(Φ(X)p)]. (10) Induction from k to k+1: Using the induction assumption W k×p× = vec[Φ(X)kp (Φ(X)kp)] and Lemma 13 we obtain W k+1× p× = W×W k×p× = (Φ(X) ⊗ Φ(X)) vec[Φ(X)kp (Φ(X)kp)] = vec[Φ(X)Φ(X)kp (Φ(X)kp)Φ(X)] = vec[Φ(X)k+1p (Φ(X)k+1p)]. (11) Lemma 3 If the measure µ(k) is such that (8) converges, then it defines a valid p.s.d. kernel. Proof Using Lemmas 13 and 2 we can write ×W k×p× = (q ⊗ q) vec[Φ(X)kp (Φ(X)kp)] q = vec[qΦ(X)kp (Φ(X)kp)q] = (qΦ(X)kp) (qΦ(X)kp) . ρ(G) ρ(G) (12) Each individual term of (8) equals ρ(G)ρ(G) for some function ρ, and is therefore a valid p.s.d. kernel. The lemma follows because the class of p.s.d. kernels is closed under convex combinations (Berg et al., 1984). 7
Vishwanathan, Borgwardt, Kondor, and Schraudolph 2.3 Special Cases A popular choice to ensure convergence of (8) is to assume µ(k) = λk for some λ > 0. If λ is sufficiently small1 then (8) is well-defined, and we can write λkq ×W k×p× = q ×(I−λW×)−1p×. (13) k(G, G) = k Kashima et al. (2004) use marginalization and probabilities of random walks to define kernels on graphs. Given transition probability matrices P and P associated with graphs G and G respectively, their kernel can be written as (Kashima et al., 2004, Eq. 1.19) k(G, G) = q ×(I−T×)−1p×, where T× := [vec(P ) vec(P )] [Φ(X) ⊗ Φ(X)]. The edge kernel ˆκ(Xij, X PijP ijκ(Xij, X G¨artner et al. (2003), on the other hand, use the adjacency matrix of the direct product i,j) with λ=1 recovers (13). graph to define the so-called random walk graph kernel (14) ij) := k(G, G) = n n ∞ i=1 j=1 k=1 λk[Ak×]ij. (15) n n To recover their kernel in our framework, assume an uniform distribution over the vertices of G and G, that is, set p = q = 1/n and p = q = 1/n. The initial as well as final probability distribution over vertices of G× is given by p× = q× = e /(nn). Setting Φ(X) := A, Φ(X) = A, and W× = A×, we can rewrite (8) to obtain ∞ k=1 k(G, G) = λkq ×Ak×p× = 1 n2n2 n n ∞ i=1 j=1 k=1 λk[Ak×]ij, which recovers (15) to within a constant factor. Finally, the so-called geometric kernel is defined as (G¨artner, 2002) k(G, G) = [eλA×]ij = eeλA×e, (16) which can be recovered in our setting by setting p = q = 1/n, p = q = 1/n, Φ(L) := A, Φ(L) = A, and µ(k) = λk/k!. i=1 j=1 3. Efficient Computation Computing the kernels of G¨artner et al. (2003) and Kashima et al. (2004) essentially boil down to inverting the matrix (I−λW×). If both G and G have n vertices, then (I−λW×) is an n2 × n2 matrix. Given that the complexity of inverting a matrix is cubic in its dimensions, a direct computation of (13) would require O(n6) time. In the first part of this section we show that iterative methods, including those based on Sylvester equations, conjugate gradients, and fixed-point iterations, can be used to speed up this computation. Later, in Section 3.4, we show that the geometric kernel can be computed in O(n3) time. 1. The values of λ which ensure convergence depend on the spectrum of W×. See Chapter 6 of Vishwanathan (2002) for a discussion of this issue. 8
分享到:
收藏