logo资料库

Constrained Local Graph Clustering by Colored Random Walk.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
Abstract
1 Introduction
2 Related Work
3 Our Proposal
3.1 Preliminaries
3.2 The Colored Random Walk
3.3 Transition Matrix Reinforcement
3.4 Convergence
3.5 The Algorithm
3.6 Speedup Strategy
4 Experimental Validation
4.1 Non-overlapping Clusters
4.2 Overlapping Clusters
4.3 Impact of Color Attraction and Repulsion
4.4 Evaluation of Speedup Strategy
5 Case Study
5.1 Karate Club Social Network
5.2 Brain Activity Network
6 Conclusion
References
Constrained Local Graph Clustering by Colored Random Walk Yaowei Yan yxy230@ist.psu.edu Pennsylvania State University Yuchen Bian yub31@ist.psu.edu Pennsylvania State University Dongsheng Luo dul262@ist.psu.edu Pennsylvania State University Dongwon Lee dlee@ist.psu.edu Pennsylvania State University ABSTRACT Detecting local graph clusters is an important problem in big graph analysis. Given seed nodes in a graph, local clustering aims at find- ing subgraphs around the seed nodes, which consist of nodes highly relevant to the seed nodes. However, existing local clustering meth- ods either allow only a single seed node, or assume all seed nodes are from the same cluster, which is not true in many real applica- tions. Moreover, the assumption that all seed nodes are in a single cluster fails to use the crucial information of relations between seed nodes. In this paper, we propose a method to take advantage of such relationship. With prior knowledge of the community membership of the seed nodes, the method labels seed nodes in the same (differ- ent) community by the same (different) color. To further use this information, we introduce a color-based random walk mechanism, where colors are propagated from the seed nodes to every node in the graph. By the interaction of identical and distinct colors, we can enclose the supervision of seed nodes into the random walk process. We also propose a heuristic strategy to speed up the algorithm by more than 2 orders of magnitude. Experimental evaluations reveal that our clustering method outperforms state-of-the-art approaches by a large margin. CCS CONCEPTS • Mathematics of computing → Graph algorithms; • Infor- mation systems → Clustering; • Theory of computation → Random walks and Markov chains. KEYWORDS community detection; local graph clustering; random walk; non- Markovian ACM Reference Format: Yaowei Yan, Yuchen Bian, Dongsheng Luo, Dongwon Lee, and Xiang Zhang. 2019. Constrained Local Graph Clustering by Colored Random Walk. In Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3308558.3313719 This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. WWW ’19, May 13–17, 2019, San Francisco, CA, USA © 2019 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-6674-8/19/05. https://doi.org/10.1145/3308558.3313719 Xiang Zhang xzhang@ist.psu.edu Pennsylvania State University Community returned by seed 1 Hong Cheng Xiaofei He Xifeng Yan Ke Wang Seed 2 H. Kitagawa I. Kamel Jiawei Han Wei Wang Papadimitriou C. Faloutsos Xiaoxin YinH. Gonzalez Jimeng Sun C.Traina Jr. Flip Korn H. Jagadish Seed 1 Community returned by both seed 1 and 2 Community returned by seed 2 Figure 1: Example of an academic social network. All nodes are data science researchers and form two subgroups by their collaborations. Query from each side returns a single subgroup (in dotted boxes). The joint query returns two sub- groups together (in the solid box). 1 INTRODUCTION The graph/network is a natural representation of real-world rela- tions. Graph clustering (or community detection) is a fundamental problem and is the base of many applications, such as online rec- ommendation, medical diagnosis, social network and biological network analysis [11, 14, 15, 18]. The clustering of a graph aims to find groups of nodes that are closely related to each other in each group. Unlike global graph clustering which retrieves all clus- ters/communities from a graph, local graph clustering focuses on the subgraph within the neighborhood of the seed/query nodes and is applicable to very large data [17]. With the increasing size of networks, recently local clustering has attracted lots of research interest [6, 8, 9, 22]. The seed nodes of local graph clustering can be single or multiple. Existing local graph clustering methods do not accept multiple seed nodes (e.g., Random Walk with Restart [19]), or simply assume that all these seeds are from the same cluster (e.g., Personalized PageRank [1]). We argue that knowing the relations of seed nodes (in the same community or not) are important, as the network data may be noisy or lossy to be clustered accurately. Furthermore, it is a relatively easy task to label the seed nodes by users or domain experts, since the number of seed nodes is small. Knowing seeds nodes are in the same community can help to merge divided subgroups to a whole community. Fig. 1 is part of a collaboration network used by a conference program committee (PC) member recommendation system. Given that Dr. Ke Wang and Dr. Kitagawa are prospectively favorable PC members, we want the system to recommend more researchers as candidates.
Clustering if seeds 1 and 2 are different Seed 2 Polyps Benign diseases Seed 1 Cancers Clustering if seed 1 and 2 are similar Figure 2: Example of a stomach disease similarity network. Vertices are diseases and edges are similarities between dis- eases. Red nodes are malignant diseases and blue nodes are benign ones. The dotted red circle contains diseases detected by a single querying seed 1. The solid blue circle encloses clustering result by knowing seed 1 and seed 2 are from dif- ferent communities. However, local clustering from either Dr. Wang or Dr. Kitagawa returns their own collaboration subgroup, which is not appropriate due to diversity concerns. On the other hand, assuming Dr. Wang and Dr. Kitagawa are in the same research community and querying jointly from both of them may return a larger community with diverse researchers relevant to them. The final recommendation can be decided by the proximity score of each researcher with respect to the two query seeds. More importantly, knowing that seed nodes are different can sep- arate a dense subgraph into semantically meaningful communities. Fig. 2 is a network of stomach diseases where red nodes are malig- nant tumors and blue nodes denote benign stomach diseases such as gastritis and ulcer. By conventional local clustering method, either query from seed 1 or from both seed 1 and 2 will enclose polyps into group of malignant diseases, which is not correct. On the con- trary, with domain knowledge that seed 2 is benign and different from seed 1, stomach polyps can be well divided from malignant tumors by our proposed method. This helps the automatic medical diagnosis system to correctly identify a specific disease [13]. Generally, existing local graph clustering methods cannot take advantage of seed node labels, and prefer well-structured networks [8]. For networks without clear community structures, the performance of such methods is limited. In this paper, we introduce a method to leverage supervision of labeled seed node. If a clustering task has multiple seed nodes, we assume that the relationships of seed nodes are already known: we know that which seeds should be in the same community, and which should not. We name this local graph clustering problem with labeled seed nodes as constrained local graph clustering. To solve the constrained local graph clustering problem, we pro- pose a colored random walk (CRW) algorithm, which is motivated by random walk based local graph clustering [1]. In addition to us- ing random walkers to explore a network, we label seed nodes and random walkers by colors, where seed nodes believed to be in the same community are marked with identical color. The algorithm launches colored random walkers for each color starting from the seed nodes with the corresponding color. Unlike previous random walk methods [3, 20] where walkers change nothing to the graph, the colored random walker will leave colors to every node it visited. The colors on each node will affect the transition probability of subsequent walkers. In general, a walker is more likely to visit nodes with the same color. In other words, walkers are attracted by nodes with the same color and repulsed by nodes with different colors. The attraction and repulsion are affected by the amount of colors accumulated on each node. By this means, communities with different color of seed nodes are divided, even in unclear network structures. For local graph clustering with single seed node, the proposed colored random method still outperforms the traditional approaches. This is due to the attraction of walkers by the community members which are assigned with the same color, especially by the central influential nodes with large degrees in each community. These nodes cumulate a higher amount of color by being visited more frequently (Fig. 9), and the cumulative color, in turn, prevents the random walkers from fleeing their community. Due to the same reason, our method is robust to the position of the seed nodes. No matter where is the seed node, the influential nodes remain the same, and so do the corresponding communities. This is different with the previous approaches, in which the performance degrades if seed nodes are close to cluster boundaries. Please refer to Section 5 for more details. We formulate the colored random walk algorithm by power iterations, and introduce a heuristic speedup strategy to reduce the running time. By the nature of local graph clustering, nodes far away from seeds are not important. The strategy can speed up the searching process by more than 100 times while increasing the local clustering accuracy at the same time. Experiments show that the proposed algorithm outperforms the state-of-the-art baseline methods on both accuracy and speed. The rest of the paper is organized as follows. Related works are discussed in Section 2. Section 3 explains the detail of our method. In Section 4, extensive experimental results are presented and analyzed. Section 5 provides two case studies for deeper insight of the proposed method, and finally Section 6 draws the conclusion. 2 RELATED WORK The current local graph clustering methods have no restriction on seed nodes. Random Walk with Restart [20] uses Markovian random walkers to explore the networks. With certain probability, the walkers jump back to seed nodes. The local clusters are cut from nodes frequently visited by the walkers. Heat Kernel Diffusion [8] instead adopts values diffused from seed nodes to evaluate how close the nodes are to the seeds. Query-biased Densest Connect Subgraph method [22] weights nodes by random walk. Then the local community is selected to minimize a predefined goodness function. All these methods assume seed nodes are from a single cluster. -4-3-2-101234-4-3-2-101234
Semi-supervised learning, including semi-supervised cluster- ing with constrains, gained success as they do not require a large amount of human annotations [4]. In semi-supervised learning, the labels are propagated from labeled seed instances by random-walk- like mechanisms through the similarity graph, and then a classifier is trained to categorize all other instances [24, 25]. However, tradi- tional semi-supervised learning focuses more on classification than clustering, and is a heavy framework since it considers both local and global consistency during the learning process. The colored random walk is also relevant to the Vertex-reinforced Random Walk (VRW), which views graph nodes as urns [16]. Each time the random walker visits a node, it puts a ball with its color into the corresponding urn. The further visiting probability of the random walker is reinforced by the numbers and colors of balls in each node. Though it is proved that single-color VRW is guaranteed to converge to a distribution [2], the converged distribution is not unique due to the randomness of the walker, which diminishes the performance of VRW in local graph clustering. Moreover, the theories of multi-color VRW are still underdeveloped. 3 OUR PROPOSAL Local graph clustering requires proximity measurement to decide which nodes should be in a local cluster. Random walk based ap- proaches such as Random Walk with Restart (a.k.a. Personalized PageRank) are commonly used to evaluate the node proximity in graphs [20]. Compared with other methods, random walk takes advantage of local network structure and reports better perfor- mance [9]. In this work, we extend the random walk method to colored random walk to solve the problem of local graph clustering with constrains. transition matrix with entries Pji = Aij / 3.1 Preliminaries The local graph clustering targets on a graph G = (V , E), where V is the node set and E is the edge set. The adjacency matrix of graph G is denoted by A, whose element Aij is the edge weight between node Vi and node Vj. P is the transposed column-stochastic j Aij. The main symbols and their definitions can be found in Table 1. The core function of local graph clustering is to measure the proximity of each node with the seed node(s). In Random Walk with Restart (RWR), a random walker starts from the seed node q, and randomly walks to a neighbor node according to edge weights. At each time point (t + 1), the walker continues walking with probability α or returns to the query node q with probability (1−α ). The proximity score of node p with respect to node q is defined as the converged probability that the walker visits node p: c(t +1) = α · P · c(t ) + (1 − α ) · s (1) th where s is the column vector of the initial distribution with q entry as 1 and all other entries 0, and c is the current distribution vector whose p th entry is the visiting probability of node p. RWR can be extended to accommodate multiple seed nodes. For n seed nodes, the initial vector s consists of entries with value 1/n for each seed node. However, if seed nodes are from multiple clusters, RWR cannot use such information. One may run RWR multiple times, each for a single seed node, yet this approach fails to take full Table 1: Main symbols Symbol G = (V , E) A, P c, s α, θ ψ (t ) λ1, λ2 Definition graph G with node set V and edge set E adjacency matrix and transition matrix current and initial color distributions forward probability and color threshold decaying factor at time t attraction and repulsion coefficient advantage the prior knowledge of cluster membership of the seed nodes. For example, if two seed nodes are known from different clusters, walkers of each seed node should avoid walking into the other seed’s cluster. We give the formal definition of constrained local graph clustering as follows. Definition 3.1 (Constrained local graph clustering). Given graph G = (V , E) and seed node set with K different colors, the con- strained local graph clustering aims to find K local communities containing seed nodes of each color. 3.2 The Colored Random Walk To solve the constrained local graph clustering problem, we propose a Colored Random Walk (CRW) method in this section, which takes advantage of seed node labels. Seed nodes of the same cluster are assigned with identical color, and seed nodes from different clusters have different colors. For each color, the algorithm sends out a colored walker from the corresponding seed nodes to explore the network. During the exploration, the walker leaves a certain amount of colored values to the visited nodes, similar to the RWR method in Eq. 1. The main difference between CRW and conventional RWR is that the movement of walkers is affected by existing colors of nodes. Generally, a walker is more likely to visit nodes with identical colors, and is less likely to visit nodes with distinct colors. Since color values are changed at each time point t, the transition matrix P should be updated accordingly. For a constrained local graph clustering task with K colors, the CRW algorithm maintains K transition matrix P(k,t ) for color k at time t. Then the Eq. 1 expands to a new power iteration formula (2) where c(k,t ) is the distribution of color k at time t, and s(k ) is the initial vector of color k, respectively. c(k,t +1) = α · P(k,t ) · c(k,t ) + (1 − α ) · s(k ) 3.3 Transition Matrix Reinforcement The transition matrix P(k,t ) in Eq. 2 is based on the original transi- tion matrix P, and is reinforced by all current color distributions c(ℓ,t ), 1 ≤ ℓ ≤ K. The reinforcement is done by P(k,t ) = P ◦ (1 + R(k,t ) ) (3) where operator ◦ is the Hadamard (entry-wise) product of two matrices with the same size [7]. By Hadamard product, we avoid adding new edges to graph G, and the reinforcement is done by weight adjustment of existing edges. In the Eq. 3, ‘1’ is used to keep the original transition matrix P as part of the reinforced transition
λ1 −λ2 ℓ matrix P(k,t ). R is a coefficient matrix defined by its entry R(k,t ) ij = c(ℓ,t ) i · λk ℓ, λk ℓ = k = ℓ k ℓ (4) In Eq. 4, if the factor λk ℓ is defined as positive when color k equals color ℓ, and negative otherwise. This definition increases the likelihood of a walker going to the node with identical color, and decreases the likelihood of walking to the node with different colors. The mechanism can be understood as that, a walker is attracted by nodes with the same color, and repulsed by nodes with different colors. Here we denote by λ1 the attraction coefficient and by λ2 the repulsion coefficient. ij ← 0 if P(k,t ) / ij ← P(k,t ) By Eq. 3 and 4, the entries of P(k,t ) are possibly negative. To make P(k,t ) as a transition matrix meaningful, it is normalized before used in Eq. 2. The normalization includes two parts. First, the negative entries of P(k,t ) are set to be zero: P(k,t ) < 0, and then each entry of the matrix is divided by the sum of its column to make the matrix column stochastic: P(k,t ) . i P(k,t ) 3.4 Convergence Unlike the traditional random walk methods which have stationary transition matrix P, the transition matrix P(k,t ) of colored random walk varies at each time t. To guarantee the convergence of Eq. 2, we introduce a decaying mechanism changing P(k,t ) to P(k,t ) (Eq. 5), where the value of function ψ (t ) approaches to 0 when t increases. Then the colored random walk is adapted to Eq. 6. ij ij ij P(k,t ) = ψ (t ) · P(k,t ) + (1 − ψ (t )) · P(k,t−1) t = 0 t ≥ 1 (5) c(k,t +1) = α · P(k,t ) · c(k,t ) + (1 − α ) · s(k ) (6) Here the definition of P remains the same with Eq. 3 and Eq. 4. Intuitively, when ψ (t ) → 0, Pk,t → Pk,t−1, meaning that Pk,t converges. With a converged Pk,t , the proximity score vector c(k,t ) converges, similar to the conventional random walk. The formal proof of convergence of c(k,t ) is given by Theorem 3.2. Theorem 3.2 (Convergence of colored random walk). Both P(k,t ) and c(k,t ) in Eq. 5 and Eq. 6 converge if ψ (t ) = Ω(αt ). Proof. To ease the description, we combine distributions, rein- forced transition matrices, and initial vectors of all color at time t by c(K,t ) c(t ) = P(2,t ) P(1,t ) c(1,t ) c(2,t ) ... s(1) s(2) ... s(K ) Evidently, Eq. 7 equals to Eq. 6 for all color 1 ≤ k ≤ K. Similarly we define P(t ) as diagonal combination of all P(t,k ). c(t +1) = α · P(t ) · c(t ) + (1 − α ) · s (7) For convergence of decayed transition matrix P(t ), it is easy to P(K,t ) . . . 0  , P(t ) =   0  , s =   know that for t ≥ 1, ∥P(t ) − P(t−1)∥1 = ψ (t )∥P(t ) − P(t−1)∥1 ≤ 2ψ (t ) P :P, S, α, λ1, λ2, T , ψ (t ) Algorithm 1: The Colored Random Walk Input Output:c 1 K ← |S|; 2 for k ← 1 to K do s(k ) ← 0; for each x ∈ S[k] do s(k ) 3 4 x ← 1/|S[k]| ; 5 s ← [s(1); s(2); . . . ; s(K )]; 6 c ← s; 7 P(0) ← IK ⊗ P; 8 P ← P(0); 9 Λ ← IN · λ1; 10 Set all non-diagonal entries of Λ as −λ2; 11 Λ ← Λ ⊗ IK ; 12 for t ← 0 to T do ; N ·K c ← α · P · c + (1 − α )s; r ← Λ · c; R ← r ⊗ 1T ˆP ← P(0) ◦ (1 + R); for each i and j do ˆPij ← 0 for ˆPij < 0; ˆPx j; P ← ψ (t ) · ˆP + (1 − ψ (t )) · P; for each i and j do ˆPij ← ˆPij / x 13 14 15 16 17 18 19 For convergence of the color distribution c(t ), we define ∆(t ) = ∥c(t ) − c(t−1)∥1. Then ∆(t +1) = ∥c(t +1) − c(t )∥1 = ∥α P(t +1)c(t ) − α P(t )c(t−1)∥1 = α∥P(t +1)c(t ) − P(t )c(t−1)∥1 = α∥P(t +1)c(t ) − P(t +1)c(t−1) + P(t +1)c(t−1) − P(t )c(t−1)∥1 = α∥P(t +1) (c(t ) − c(t−1) ) + (Pt +1 − Pt )ct−1∥1 ≤ α∥P(t +1) (c(t ) − c(t−1) )∥1 + α∥(P(t +1) − P(t ) )ct−1∥1 ≤ α∥P(t +1)∥1∥c(t ) − c(t−1)∥1 + α∥P(t +1) − P(t )∥1∥ct−1∥1 ≤ α∥c(t ) − c(t−1)∥1 + α∥P(t +1) − P(t )∥1 ∆(t−1) + 2α ≤ α ∆(t ) + 2αψ (t + 1) ≤ α 2 ψ (t ) + 2αψ (t + 1) ≤ αt ∆(1) + 2 ατ ψ (t − τ + 2) t 2 τ =1 To make ∆(t +1) converge to 0, a sufficient condition is thatψ (t ) = Ω(αt ), which means 0 ≤ cψ (t ) ≤ αt for any t > t0 and c > 0. Given ψ (t ) ≤ 2αt−1 for t > t0, ∆(t +1) ≤ 2αt + 2 ατ + 2 αt +1 t τ =t +2−t0 t +1−t0 τ =1 < 2αt + 2(t0 − 1)αt−t0+2 + 2(t − t0 + 1)αt +1 < 2t0αt−t0+2 + 2(t − t0 + 1)αt +1 → 0 □
5 6 7 8 9 10 11 12 13 14 15 16 17 18 3.5 The Algorithm By Eq. 7, we can formulate the colored random walk as shown in Algorithm 1. The input of the algorithm includes original transition matrix P, seed node sets S of all colors, penalty factor α, attraction coefficient λ1 and repulsion coefficient λ2, together with the number of iterations T and the decaying faction ψ (t ). The output of the algorithm is the color distribution vector c for all colors. The details of the algorithm are as follows. Lines 1 – 11 initialize the algorithm. Specifically, in line 1 we get the number of colors K by the number of seed node sets in S. Lines 3 and 4 prepare for the initial color vectors s(k ) for each color k. From line 5 to line 8 we combine initial distributions and transition matrix of all colors together by s and P(0), as explained in Section 3.4, and initialize the color distribution vector c and the modified transition matrix P. To calculate the color reinforcement for each node, we create a N × N factor matrix Λ (lines 9 to 11), which originally has the value of λ1 on all diagonal entries and−λ2 on all non-diagonal entries. The size of matrix Λ is then amplified (line 11) by Kronecker Product [7] with an identity matrix IK , to multiply with vector c (line 14), which is a matrix-form equivalence to the operation in Eq. 4. The for loop from line 12 to 19 is the power iteration process. We first generate the color distribution for next iteration (line 13), and then generate the reinforcement vector r (line 14). Vector r is repeated to form a matrix R (line 15), which means that the rein- forcement is fair to every node. After generating new reinforced matrix ˆP (line 16), we remove negative entries (line 17) and nor- malize it to column stochastic (line 18), and update the decayed transition matrix for the next iteration (line 19). The power iteration part of the algorithm is critical in analyzing the time complexity, where all operations including calculations of color distribution (line 13), reinforcement (line 14) and decayed transition matrix (line 19) take O (K|E|), if we base the operations on sparse matrices. Therefore, the overall time complexity is O (T K|E|). As empirically reported in [9], T = 10 is sufficient for the conver- gence of distribution c. With the distribution c for each color, we rank nodes by their colors and cut local communities by sweeping from the top-rank node and find a cluster with minimum conductance [1]. 3.6 Speedup Strategy Time complexity of the proposed colored random walk algorithm is proportional to the number of edges |E|, implying that the algorithm searches the entire graph for local clusters. This is undesirable as the task of local clustering focuses only on the nodes close to the seeds. In this section, we propose a localized variant of Algorithm 1, with the basic idea that random walkers only search the influential nodes around the seed nodes. Unlike Algorithm 1 where the walkers go as far as T -hops from the seeds, we only consider nodes with colors above a given tolerance threshold θ. As the random process in Eq. 2 can be viewed as the diffusion of colors from each node to its neighbors, the key point of Algorithm 2 is that only colors with values above a threshold θ will diffuse to next iteration. The intuition behind this is that small color amount will not affect colors of next iteration very much. When a walker Algorithm 2: The Localized Colored Random Walk Input Output:c :P, S, α, λ1, λ2, T , θ 1 for each color k do 2 c(k ) ← 0; for each node i ∈ S[k] do c(k ) 3 i ← 1/|S[k]| ; 4 for t ← 1 to T do ˆc(k ) ← 0; for each color k do i > θ do for each c(k ) p ← 0; for each neighbor j of node i do pj ← P(k,t ) for each color ℓ do ij ; if ℓ = k then pj ← pj + λ1 · c(ℓ) else pj ← pj − λ2 · c(k ) ; j ; j · pj / p ; + α · c(k ) i if pj < 0 then pj ← 0 ; for each pj > 0 do ˆc(k ) j ← ˆc(k ) for each seed node i ∈ S[k] do + (1 − α )/|S[k]|; i ← ˆc(k ) ˆc(k ) i j for each color k do c(k ) ← ˆc(k ) ; reaches a node with color value less than θ, it stops there and does not go to the node’s neighbors, which prevents the process from searching too many further nodes. We also neglect the decaying function ψ (t ) and reinforce the node on demand (with color value greater than θ) to accelerate the algorithm. In Algorithm 2, lines 1 – 3 initialize the color distribution accord- ing to seed nodes, and lines 4 – 18 simulate the power iteration in a heuristic way. In this heuristic algorithm, we remove the process of decaying, as it is shown that the result of experiments without decaying is similar. Specifically, at each iteration (line 4) the current color scores are in c(k ) for color k, and the new scores for next iter- ation are calculated and stored in ˆc(k ). Note that, c(k ) is changed to ˆc(k ) after the scores of all colors are updated (line 18). Note that the color scores are updated simultaneously. By this strategy, though color values diminish at each iteration, the summation of colors remains its majority value. The difference turns out to be less than 1%. 4 EXPERIMENTAL VALIDATION Extensive experiments are conducted on both synthetic and real- world data to evaluate the effectiveness and efficiency of the pro- posed algorithms. All experiments are performed on a server with Redhat OS, Xeon 2.6GHz CPU, and 64G memory. The core functions are implemented by C++. Each experimental result is gathered on average of 10,000 trials. The querying seeds are randomly selected. The accuracy is mea- sured by F1-score, which is defined as 2pr /(p + r ), where p and r
are precision and recall, respectively. Table 2: 1-query accuracy for non-overlapping clusters 4.1 Non-overlapping Clusters We use the following networks with non-overlapping ground truth clusters to test the effectiveness of our proposed method. The clus- ters are large enough to test the searching ability of the colored random algorithm algorithm. Disease networks. The networks contain a disease similarity net- work [13] with 9,721 diseases. For this network, there are two sets of clustering ground truth for the disease network, with 17 classes and 47 classes, respectively. We use both sets of ground truth, and denote the experiments by Disease 1 and Disease 2, accordingly. Disease 3 is a phenotype similarity network with 5,080 diseases in 20 categories [21]. Gene network. The gene network represents the functional rela- tionship between 8,503 genes [21] with 200 pathway labels. Author network. This network comprises 20,111 authors from 4 research areas: Data Mining, Data Base, Information Retrieval and Machine Learning. The edge of this network is the number of papers co-published by the authors. Conference network. This is a full graph comprises 20 main con- ferences from the same 4 areas with the author network. The edges are cosine similarity of keywords from the publication of the con- ferences. We use research areas as the ground truth for both author network and conference network. Since no previous method is designed for the constrained local graph clustering problem, we compare the proposed colored ran- dom walk (CRW) method with the following five state-of-the-art local clustering methods, which support multiple seed node query- ing: Random Walk with Restart (RWR) [20] uses Eq. 1 to calculate the proximity score of each node; Heat Kernel Diffusion (HKD) [8] measures node proximity by heat kernel network diffusion; Query- biased Densest Connect subgraph (QDC) [22] assigns weights to nodes using RWR scores and finds the query biased densest sub- graph; Vertex-reinforced Random Walk (VRW) [12] reinforces the next-hop visiting probability by previous visiting counts. For our method CRW, we use the localized colored random walk (Algo- rithm 2) with α = 0.9, λ1 = 1000, λ2 = 10, θ = 10−5. Effectiveness evaluations are conducted on the selected networks. Since the accuracy difference between our two algorithms is small, we present the results of Algorithm 2 only. Table 2 shows the result of single seed node querying. By column “CRW” we only use one seed node with a single color, so there is no repulsion of different colors, which serves as a baseline of our method. From the results in Table 2, it can be seen that even the basic CRW can outperform the state-of-the-art methods in terms of clustering accuracy. In Table 2 we add an extra column of CRW1 1, which randomly picks up two seed nodes of different colors to examine the impact of color repulsion. Comparison of CRW and CRW1 1 shows that resistance of distinct colors helps to improve the clustering accuracy significantly. On all datasets CRW1 1 achieves the best performance. Table 3 shows the result of 2-seed-node query on the same 6 datasets. For each method except CWR2 2 we use two seed nodes from the same cluster. CWR2 1 by choosing 2 colors, each having two seeds. Though initially VRW is not designed 2 is similar to CWR1 F1-score Disease 1 Disease 2 Disease 3 Gene Author Conference RWR HKD QDC VRW CRW CRW1 1 0.33 0.37 0.42 0.53 0.24 0.38 0.05 0.22 0.27 0.40 0.42 0.71 0.34 0.49 0.30 0.16 0.30 0.66 0.27 0.48 0.30 0.06 0.03 0.38 0.11 0.32 0.14 0.09 0.01 0.40 0.08 0.25 0.13 0.15 0.01 0.57 Table 3: 2-query accuracy for non-overlapping clusters F1-score Disease 1 Disease 2 Disease 3 Gene Author Conference RWR HKD QDC VRW CRW CRW2 2 0.40 0.44 0.43 0.61 0.26 0.39 0.05 0.35 0.32 0.41 0.43 0.91 0.41 0.57 0.38 0.21 0.35 0.89 0.30 0.54 0.33 0.07 0.02 0.45 0.11 0.38 0.19 0.08 0.01 0.40 0.10 0.29 0.10 0.17 0.01 0.57 for multiple seed nodes, we make it work by adding extra random walkers that are reinforced by rules of the original VRW. Comparing Table 3 with Table 2, the conclusion is that almost all methods improve clustering accuracy by having more seeds. Furthermore, CWR gains more improvement on average. Again CRW1 1 and CRW2 2 have the best performance among all chosen methods. 4.2 Overlapping Clusters Intuitively colored random walk works with non-overlapping clus- ters. However, due to its color-agglomerating nature, it also works for detecting overlapping clusters. Table 4 lists the statistics 4 of networks with overlapping ground truth communities (Amazon, DBLP, YouTube, LiveJournal). These datasets are publicly available at http://snap.stanford.edu. For the quality of clusters, we select clusters with size larger than 10, and within ±3σ range of the av- erage cluster size. The last column of Table 4 shows the mean and standard deviation (σ) of the chosen clusters. Table 4: Networks with overlapping communities Name Amazon DBLP YouTube LiveJournal |V | 334,863 317,080 1,134,890 3,997,962 |E| 925,872 1,049,866 2,987,624 34,681,189 # clusters 75,149 13,477 8,385 287,512 avg (σ) 13.23 (11.24) 10.87 (20.41) 10.22 (18.81) 23.67 (24.75) Table 5: 1-query accuracy on overlapping communities F1-score Amazon DBLP YouTube LiveJournal RWR HKD QDC VRW CRW CRW1 1 0.92 0.93 0.41 0.76 0.05 0.30 0.69 0.84 0.92 0.72 0.27 0.81 0.93 0.59 0.06 0.74 0.90 0.65 0.19 0.74 0.70 0.57 0.13 0.42
Table 6: 2-query accuracy on overlapping communities F1-score Amazon DBLP YouTube LiveJournal RWR HKD QDC VRW CRW CRW2 2 0.93 0.96 0.42 0.81 0.05 0.42 0.69 0.86 0.95 0.78 0.38 0.84 0.94 0.61 0.06 0.75 0.91 0.73 0.21 0.79 0.79 0.68 0.16 0.45 The clustering results are listed in both Table 5 and Table 6. It shows that 2-seed quires are generally better than single seed node. Among all methods HKD is relatively good since it is refined for the small overlapping communities. However, more seed nodes do not improve its performance very much (only 1% on average). The probability-reinforced methods VRW and CRW gain the biggest improvement by adding an extra seed. Once again our proposed method leads the accuracy measurements. For seed nodes more than 2, Fig. 3 shows that most methods perform better with more seeds. However, CRW, VRW and QDC take better advantage of more seed nodes, as they all have mechanisms of reinforcement on node weights. Fig. 3 gives the results where seed nodes are from a single commu- nity. However, when seed nodes are from distinct communities, the other methods cannot use this relationship, so their performances are very low (Fig. 4). On the contrary, our method takes advantage of this cannot-link information, resulting in the increased accuracy with more communities (colors). We also evaluate the efficiency of the selected methods on datasets provided in Table 4, since networks are relatively large. To be fair, the running time is measured on single color and single node on every method. Fig. 5 shows that CRW is one of the fastest methods of all, comparable to the HKD method, and is generally 100 times faster than the other approaches. Generally, CRW takes no more than 200 milliseconds in finding a target local cluster. For the scala- bility of colors and seed nodes, the running time of CRW does not increase with number of seed nodes, and is proportional with the number of colors. 4.3 Impact of Color Attraction and Repulsion To analyze the sensitivity of the main parameters of our method (λ1, λ2, θ), we generate synthetic networks by the standard LFR benchmark [10]. We first test the impact of attraction and repulsion of Algorithm 1, by varying factor λ1 and λ2 on the synthetic graph of Fig. 6. The results are presented in Fig. 7, in terms of accuracy vs. attraction λ1 and repulsion λ2. We set ψ (t ) = 0.9t is this experiment. First we try our algorithm on the traditional problem of local graph clustering, where there is only one target community and accordingly only one color (Fig. 7a). We randomly choose 1 seed node (1×1) or 2 seed nodes (2×1) from a community, or 4 seed nodes from 2 different communities (2×2). Since there is no different color, we use attraction coefficient λ1 only and set λ2 = 0. In Fig. 7a the x- axis is the logarithmic value of λ1 and y-axis is the F1-score. When λ1 is close to 0, our algorithm degenerates into the conventional random walk based local clustering. By this means we can compare our methods with the basic random walk, as well as measuring the sensitivity of parameters λ1 and λ2. (a) Amazon (b) DBLP (c) YouTube (d) LiveJournal Figure 3: Impact of adding more seed nodes (a) Amazon (b) DBLP (c) YouTube (d) LiveJournal Figure 4: Impact of adding more colors From curve of 1×1 we can see that the clustering accuracy in- creases obviously with the parameter λ1. This indicates that even on traditional graph clustering problem, our algorithm still outper- forms the basic random walk method. For curve 2×1, initially its accuracy is higher than 1×1 because of the information introduced by more seed nodes. However, the accuracy drops slightly after λ1 > 103, because excessively strong attraction around the two in- dividual seed nodes separates the target community. The attraction even works when seeds nodes are not from the same community, but labeled as the same color (curve 2×2). 246810Number of seeds00.20.40.60.81F1-scoreCWRHKDQDCRWRVRW246810Number of seeds00.20.40.60.81F1-score246810Number of seeds00.20.40.60.81F1-score246810Number of seeds00.20.40.60.81F1-score246810Number of color00.20.40.60.81F1-scoreCWRHKDQDCRWRVRW246810Number of color00.20.40.60.81F1-score246810Number of color00.20.40.60.81F1-score246810Number of color00.20.40.60.81F1-score
Figure 5: Running time comparison (a) Attraction only (b) Repulsion only (c) 1×2 attraction and repulsion (d) 2×2 attraction and repulsion Community 3 Figure 7: Impact of color attraction and repulsion. λ1 is the attraction coefficient and λ2 is the repulsion coefficient. Community 2 Community 1 Figure 6: A 100-node graph generated by the LFR bench- mark. The red, blue, green nodes are from 3 distinct ground- truth communities. Similar results are observed by using repulsion force only and set λ1 = 0. In experiments shown in Fig. 7b, we select 1 or 2 nodes from two adjacent communities and label them with different colors. The figures illustrate that the repulsion enhances clustering accuracy, as walkers from the two different communities are divided by distinct colors. However, overdose of repulsion (λ2 > 10) drives walkers from the border of communities and decreases the accuracy. Finally, we combine both attraction and repulsion forces together (shown in Fig. 7c and 7d). The best accuracy results come with λ1 = 104 and λ2 = 102, and the performance are maintained when these two parameters keep on increasing. The favorable λ1 and λ2 are large because of the stabilizer “1” in color reinforcement (Eq. 3), which requires large λ1 and λ2 to change the behavior of random walkers. 4.4 Evaluation of Speedup Strategy We generate a network with 1 million nodes and 15 million edges by LFR benchmark to test the impact of threshold θ on Algorithm 2. The results are presented in Fig. 8. When the threshold θ rises from 10−15 to 10−5, the average running time drops dramatically (Fig. 8a). Note that when θ → 0, Algorithm 2 reduces to a non-decaying version of Algorithm 1. The total speedup is more than 2 orders of magnitudes, when threshold increases to above 10−5. (a) Running time (b) Local clustering accuracy (c) Vector difference (d) Spearman correlation Figure 8: Impact of threshold θ on Algorithm 2 In the meantime, the clustering accuracy remains intact and even increases from θ = 10−5. Looking closely, the recall of clus- tering remains stable while the precision rises after θ > 10−5. This is because the nodes of the target local clustering (true positive) are not faraway from seed nodes. Larger threshold θ makes the search more local: thus irrelevant nodes will not be enclosed, which reduces the false positive. We also examine the difference between the output vectors from Algorithm 1 and Algorithm 2. From Fig. 8d, the summation of absolute difference value is only 1.03% at θ = 10−5 (Fig. 8c). Does AmazonDBLPYoutubeLiveJournal100105Running Time (MS)RWRHKDQDCVRWCRW-5-4-3-2-1012345-5-4-3-2-1012345 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100-20246log(61)0.50.60.70.80.91F1-Score1#12#12#2-20246log(62)0.50.60.70.80.91F1-Score1#22#260.646F1-Score0.84log(62)2log(61)1200-2-260.646F1-Score0.84log(62)2log(61)1200-2-210-1510-1010-5Threshold 3102104106108Microseconds10-1510-1010-5Threshold 30.80.850.90.951F1-score10-1510-1010-5Threshold 300.050.10.150.2Difference10-1510-1010-5Threshold 30.80.850.90.951Spearman Correlation
分享到:
收藏