logo资料库

k-core decomposition.pdf

第1页 / 共12页
第2页 / 共12页
第3页 / 共12页
第4页 / 共12页
第5页 / 共12页
第6页 / 共12页
第7页 / 共12页
第8页 / 共12页
资料共12页,剩余部分请下载后查看
Efficient Core Decomposition in Massive Networks James Cheng1, Yiping Ke2, Shumo Chu3, M. Tamer ¨Ozsu4 1,3School of Computer Engineering Nanyang Technological University Nanyang Avenue, Singapore 1j.cheng@acm.org 3shumo@pmail.ntu.edu.sg 2Shenzhen Institutes of Advanced Technology Chinese Academy of Science, China 2yp.ke@siat.ac.cn 4David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada 4tamer.ozsu@uwaterloo.ca Abstract—The k-core of a graph is the largest subgraph in which every vertex is connected to at least k other vertices within the subgraph. Core decomposition finds the k-core of the graph for every possible k. Past studies have shown important applica- tions of core decomposition such as in the study of the properties of large networks (e.g., sustainability, connectivity, centrality, etc.), for solving NP-hard problems efficiently in real networks (e.g., maximum clique finding, densest subgraph approximation, etc.), and for large-scale network fingerprinting and visualization. The k-core is a well accepted concept partly because there exists a simple and efficient algorithm for core decomposition, by recursively removing the lowest degree vertices and their incident edges. However, this algorithm requires random access to the graph and hence assumes the entire graph can be kept in main memory. Nevertheless, real-world networks such as online social networks have become exceedingly large in recent years and still keep growing at a steady rate. In this paper, we propose the first external-memory algorithm for core decomposition in massive graphs. When the memory is large enough to hold the graph, our algorithm achieves comparable performance as the in-memory algorithm. When the graph is too large to be kept in the memory, our algorithm requires only O(kmax ) scans of the graph, where kmax is the largest core number of the graph. We demonstrate the efficiency of our algorithm on real networks with up to 52.9 million vertices and 1.65 billion edges. I. INTRODUCTION Given a graph G, the k-core of G is the largest subgraph of G in which every vertex has degree of at least k within the subgraph [1]. The problem of core decomposition in G is to find the k-core of G for all k. Core decomposition has been shown to be an important concept in the study of graph properties and has many signifi- cant applications in network analysis. It was first introduced to simplify graph topology to aid in the analysis and visualization of networks [1], [2]. It was then recognized as an important tool for visualization of complex networks and interpretation of cooperative processes in them [3], [4]. It was used to analyze complex networks [5], in particular their hierarchies, self-similarity, centrality, and connectivity. It was employed to find structural fingerprints of large-scale networks and design effective visualization tools [6]. It was used to describe the architecture of randomly damaged uncorrelated networks [7]. It was applied to predict protein functions based on the k- cores of protein-protein interaction networks and amino acid sequences [8]. It was used to analyze the static structure of large-scale software systems [9]. In addition, hierarchical degree core tree [10] was proposed for summarizing the structure of massive graphs and performing model validation. In graph theory, the concept of k-core has been extensively studied in random graphs to understand various graph prop- erties [11], [12], [13], [14], [15]. The k-cores can be used as heuristics for maximum clique finding since a clique of size k is guaranteed to be in a (k−1)-core, which can be significantly smaller than the original graph. Moreover, core decomposition can be applied to give a (1/2)-approximation algorithm for the densest subgraph problem [16] and a (1/3)-approximation algorithm for the densest at-least-k-subgraph problem [17] in linear time. It can also be used as an approximation of betweenness score [10] and to discover dense clusters in noisy spatial data [18]. Compared with the computation of other similar concepts of cohesive groups in a network [19], such as cliques, n- cliques, n-clans, k-plexes, f -groups, n-clubs, lambda sets, most of which are algorithmically difficult (NP-hard or at least quadratic), there exists a simple and efficient algorithm for computing k-cores. Given a graph G, we can compute the k-core of G by re- cursively deleting all the vertices (together with their incident edges) with degree less than k. Batagelj and Zaversnik [20] propose a linear algorithm for core decomposition, which uses bin-sort to order the vertices and recursively deletes the vertex with the lowest degree. This algorithm, however, requires random accesses to the graph and thus assumes that the whole graph can be kept in main memory. Unfortunately, many real-world networks have grown exceedingly large in recent years and are continuing to grow at a steady rate. For example, the Web graph has 978-1-4244-8960-2/11/$26.00 © 2011 IEEE 51 ICDE Conference 2011
over 1 trillion webpages (Google), most social networks (e.g., Facebook, MSN) have hundreds of millions to billions of participants, many citation networks (e.g., DBLP, Citeseer) have millions of publications, other networks such as phone- call networks, email networks, stock-market networks, etc., are also massively large. In this paper, we develop the first external-memory al- gorithm for core decomposition in a massive network. The result of this research can be readily applied to design an efficient solution or approximation of many important but high-complexity problems in massive real networks which cannot fit in main memory. Quite a few of these problems are described above, and all of these problems assume the existence of a linear in-memory algorithm to find the k-cores. All existing in-memory algorithms are bottom-up ap- proaches that compute a k-core before a (k+1)-core. However, a k-core is a supergraph of a (k + 1)-core and the 0-core is essentially the entire graph. Therefore, the bottom-up approach cannot be adopted to design an efficient external-memory algorithm. We devise a novel top-down approach that recursively computes the k-cores from larger values of k to smaller ones, and progressively reduces search space and disk I/O cost by removing the vertices in each computed k-core. Our algo- rithm, EMcore, consists of three key components: an efficient strategy for graph partitioning, an effective mechanism for estimating the upper bound of the core number of the vertices, and a recursive top-down core decomposition procedure. Our graph partitioning cuts the original graph into small blocks by only one scan of the graph. This allows core decomposition to load only the relevant blocks into main memory at each recursive step. The estimation of the upper bound on the core number of vertices ensures the correctness of our top- down approach as well as reducing the overall search space. We develop an effective mechanism for estimating the upper bound and progressively refining it at each recursive step. Based on the upper bound, our top-down strategy identifies a range of k-cores to be computed in main memory by loading only the relevant blocks. We prove that our algorithm uses only O(kmax ) scans of the graph in the worst case, where kmax is the largest possible value of k for any k-core in the graph. In practice, however, our experimental results show that the actual amount of I/Os required is significantly less than that required for kmax scans of the entire graph. Our experimental results on various massive real networks verify that our algorithm is both CPU-efficient and I/O- efficient. When main memory is sufficient to hold the entire network, our top-down algorithm achieves comparable perfor- mance to the state-of-the-art in-memory algorithm for core decomposition [20]. When main memory is not sufficient to keep the entire network, our algorithm is able to perform core decomposition efficiently for networks with up to 52.9 million vertices and 1.65 billion edges, while the in-memory algorithm fails in these large graphs. TABLE I NOTATIONS Description Size of main memory (internal memory) Size of disk (external memory) block Number of vertices in graph G = (V, E) Number of edges in graph G = (V, E) Size of G, defined as |G| = |E| = m The set of neighbors of a vertex v in G / G The degree of v in G / G The k-core of G The maximum core number of any vertex in G Core number of v in G, the largest k s.t. v ∈ VCk The upper bound on the core number of v in G The k-class of G, Ψk = {v : v ∈ V, ψ(v) = k} The maximum ψ value among all vertices in X The number of v’s neighbors with ψ > ku Symbol M B n m |G| ) nb(v); nb(v, G ) Ck = (VCk , ECk d(v); d(v, G ) kmax ψ(v) ψ(v) Ψk ψmax (X) deposit (v, ku) Organization. Section II formally defines the problem and gives the basic notations. Section III describes the in-memory algorithm. Section IV presents the overall framework of our solution. Section V details the EMcore algorithm. Section VI reports the experimental results. Section VII discusses the related work. Section VIII gives the conclusion. II. NOTATIONS AND PROBLEM DEFINITION In this paper, we focus on large networks that are modeled as graphs. For simplicity of presentation, we discuss our algorithm for undirected graphs only. The algorithm can be extended to handle directed graphs in a way similar to the adaption of the in-memory algorithm for directed graphs [20]. Table I shows the notations used throughout the paper. For external-memory algorithm analysis, we use the standard I/O model [21] with the following parameters: M is the main memory size and B is the disk block size, where 1 B ≤ M/2. In most practical cases, we consider the internal memory as the main memory and the external memory as the disk, though other memory hierarchies are also possible. Let G = (V, E) be an undirected and unlabeled graph. We define n = |V | and m = |E|. We define the size of G, denoted as |G|, as the number of edges in G, i.e., |G| = m. We define the set of neighbors of a vertex v in G as nb(v) = {u : (u, v) ∈ E}, and the degree of v in G as d(v) = |nb(v)|. = (VG , EG ) of G, we define Similarly, for a subgraph G )|. nb(v, G The k-core [1] of G is the largest subgraph Ck = (VCk , ECk ) of G such that ∀v ∈ VCk, d(v, Ck) ≥ k. The core number of a vertex v ∈ V , denoted as ψ(v), is defined as the largest k such that v is in Ck, i.e., “ψ(v) = k” means that v ∈ VCk and v /∈ VCk+1. We further define the notion of k-class as follows. ) = {u : (u, v) ∈ EG} and d(v, G ) = |nb(v, G Definition 1 (k-CLASS): Given a graph G = (V, E), the k-class, Ψk, of G is defined as Ψk = {v : v ∈ V, ψ(v) = k}. In particular, the 0-class, Ψ0, is the set of vertices in G with degree of 0. 52
The problem of core decomposition is: given a graph G, find the k-core of G for k = 0, 1, . . . , kmax , where kmax is the maximum core number of any vertex in G. Equivalently, the problem is to find the k-class of G for k = 0, 1, . . . , kmax . In this paper, we propose an external-memory algorithm that finds the k-classes of G when main memory is not large enough to hold the whole graph G. From the k-classes, we can easily obtain any k-core as the induced subgraph of G by the vertex set Vk = k≤i≤kmax Ψi. The following example illustrates the concept of core de- composition. Example 1: Figure 1 shows a graph G that contains 14 vertices, a, . . . , n. The 0-class, Ψ0, of G is an empty set since there is no isolated vertex in G. The 1-class Ψ1 = {a, b, c, j}, the 2-class Ψ2 = {d, e, f, g, m}, and the 3-class Ψ3 = {h, i, k, l, n}. In this example, we have kmax = 3. The 0≤i≤3 Ψi. 0-core is the induced subgraph of G by vertex set Since Ψ0 = ∅ in this example, the 0-core is the same as the 1-core, which is simply the entire graph. The 2-core is the induced subgraph by (Ψ2 ∪ Ψ3) and the 3-core is the induced subgraph by Ψ3. One can easily verify that in the k-core, every vertex is connected to at least k other vertices, where 0 ≤ k ≤ 3. 2 Fig. 1. A Graph and Its k-classes III. IN-MEMORY CORE DECOMPOSITION In this section, we describe the existing in-memory al- gorithm for core decomposition and explain why it is not suitable to be extended to an external-memory algorithm. We also discuss the advantage of the top-down approach over the bottom-up approach for core decomposition in large graphs. Algorithm 1 describes the skeleton of the existing in- memory algorithm for core decomposition [20]. It is a bottom- up approach as the algorithm starts the computation of the k- class from smaller values of k and moves to larger values of k. The algorithm first sorts the vertices in ascending order of their degree (among the vertices with the same degree, the ordering can be arbitrary). Then, it starts to compute the d- class Ψd, where d is the minimum vertex degree in G. It removes from G all vertices with degree of d, together with all the edges incident to them, and puts these vertices into Ψd. After removing these vertices and their edges, the degree of some vertices that are previously connected with the removed vertices decreases. If any vertices remaining in G now have 53 Algorithm 1 BottomUp Input: G = (V, E). Output: The k-class, Ψk, of G for all k. let d be the minimum vertex degree in G; Ψd ← ∅; while (there exists a vertex v with degree of at most d) 1. order the vertices in G in ascending order of their degree; 2. while (G is not empty) 3. 4. 5. 6. 7. 8. Ψd ← Ψd ∪ {v}; remove v and all edges incident to v from G; re-order the remaining vertices in G in ascending order of their degree; 9. output all Ψk; degree of d or less, they cannot be in a k-class where k > d, and thus must be in Ψd. Therefore, they are added to Ψd and removed from G. This process continues until all vertices remaining in G have degree greater than d. Then, the algorithm moves on to the next iteration to compute the next k-class where k > d. The algorithm terminates when all vertices, and their edges, are removed from G. The following example further explains the bottom-up com- putation. Example 2: Suppose that we compute the k-classes of the graph in Figure 1 by Algorithm 1. Line 1 sorts the vertices as follows: {a(1), c(1), j(1), f (2), m(2), b(3), d(3), g(3), e(4), k(4), n(4), h(5), l(5), i(6)}, where the number in the parentheses indicates the degree of the vertex in the current G. Then, in the first iteration (Lines 3-8) that computes the 1- class, the vertices a, c and j are first removed from G together with their incident edges since their degree is 1. After that, the vertices are re-ordered as {b(1), f (2), m(2), d(3), g(3), e(4), k(4), n(4), h(5), l(5), i(5)}. Then vertex b is also removed because after removing a and c, the degree of b becomes 1. At this point, the ordered vertex list becomes {f (2), m(2), d(2), g(3), e(4), k(4), n(4), h(5), l(5), i(5)}. The first iteration completes and obtains Ψ1 = {a, c, j, b}. In the second iteration that computes the 2-class, the vertices removed are (in the order): f , m, d, g, and e. In the third iteration that computes the 3-class, the vertices removed are (in the order): n, k, h, l, and i. 2 When main memory is sufficient to hold the entire input graph, the most costly step of the in-memory algorithm is sorting the vertices according to their degree at each iteration (Line 8). Batagelj and Zaversnik [20] use bin-sort to order the vertices to achieve O(m + n) time complexity. When the graph is too large to be placed in main memory, the bottom-up approach fails because it requires memory space of Ω(m + n). The bottom-up approach is not suitable for designing an efficient external-memory algorithm since it starts core decomposition over the entire graph to compute the k- class with the smallest k. Although sorting can be done in O( n n B ) I/Os (assuming that we store separately the B log M B
vertices and their degree on consecutive disk blocks), it is non- trivial to address the problem of random accesses to vertices and edges on disk during the computation. In this paper, we propose a top-down approach, which first computes the k-class for larger k and then moves down to smaller values of k. The top-down approach is more suitable for designing an external-memory algorithm. An- other advantage of the top-down approach over the bottom- up approach is that most applications and algorithm designs using k-cores only need the k-cores with larger values of k, and sometimes, only the kmax -core (e.g., for defining the nucleus of a communication network [4], for predicting protein functions [8], for approximating the densest subgraph [16], etc.). The top-down approach can obtain the k-cores with larger k without computing those with smaller k. This can result in huge savings of computational resources for handling massive networks, because the smaller the value of k, the larger is the graph from which the k-core is computed. IV. OVERALL FRAMEWORK In this section, we first give the overall framework of our external-memory algorithm, called EMcore, for core decom- position, and then discuss the details of each part in the next section. 1) Graph partitioning: The first step of EMcore is to partition the vertex set V of G, so that core decomposition by EMcore in later steps can be processed by loading only the correspond- ing subgraphs of the relevant subsets of vertices (instead of the whole graph) in main memory. We describe an efficient partitioning strategy that uses only one scan of G. 2) Estimation of the upper bound on the core number: We develop heuristics to estimate the upper bound on the core number of the vertices in each subset in the partition. The upper bound is refined progressively at each step of core decomposition. 3) Recursive top-down core decomposition based on the upper-bound core number: The main step of EMcore is a recursive procedure of core decomposition. Based on the upper bound on the core number of the vertices, EMcore recursively computes the k-classes on a set of vertices whose core number falls within a range such that the corresponding relevant subgraph can be kept in main memory. Our algorithm is top-down, meaning that we compute the k- classes with ranges of larger core numbers before those of smaller core numbers. At the end of each recursive step, we remove from G the k-classes already computed, as well as their incident edges, to reduce the search space and disk I/O cost for the next recursion of core decomposition. Algorithm 2 PartitionGraph Input: G = (V, E), memory size M , disk block size B. Output: A partition of disjoint vertex sets, U = {U1, . . . , Ul}. // the part of U that are currently in memory 1. create an empty partition U ; 2. Umem = U ; for each vertex, v ∈ V , do 3. 4. 5. 6. 7. else if(v is not connected to a vertex in any vertex set in Umem) create a new set U = {v} and add U to Umem ; find the vertex set U ∈ Umem to which v has the most connections; add v to U; if( u∈U d(u) = B) WriteBlock(U); // memory used for U reaches B 8. 9. 10. 11. 12. 13. if( u∈U d(u) = M) U∈Umem let Umax be the vertex set in Umem s.t. d(u) ≥ ∀U ∈ Umem , WriteBlock(Umax ); u∈Umax u∈U d(u); Procedure 3 WriteBlock(U ) 1. let H be the subgraph of G that consists of all edges connected to the vertices in U; 2. estimate ψ(v) for each vertex v in H; 3. write H to the disk and remove U from Umem ; V. EXTERNAL-MEMORY CORE DECOMPOSITION In this section, we discuss in detail the three parts in the overall framework, followed by a detailed analysis of the algorithm. A. Graph Partitioning The purpose of graph partitioning is to decompose the orig- inal large graph into a set of smaller subgraphs, so that core decomposition can be processed by loading only those relevant subgraphs into main memory. There are many existing graph partitioning algorithms [22], [23], but they are in-memory algorithms. For core decomposition in massive networks that cannot fit into main memory, we need an efficient algorithm that partitions a large graph with limited memory consumption. To this end, we devise a simple graph partitioning algorithm that requires only one scan of the original graph and has linear CPU time complexity. The outline of the partitioning algorithm is given in Algo- rithm 2. The algorithm reads the input graph G from external memory once and partitions V into a set of disjoint vertex sets, U = {U1, . . . , Ul}, where 1≤i≤l Ui = V and Ui ∩ Uj = ∅ for all i = j. For each vertex v, the algorithm processes v as follows. There are two cases. The first case (Lines 4-5 of Algorithm 2) is that v is not connected to any vertex in any vertex set in Umem (the part of U that are currently in main memory). In this case, we simply create a new vertex set U with a single vertex v and add U to Umem . The second case (Lines 6-10 of Algorithm 2) is that v is connected to some vertex(es) in some existing vertex set(s) in 54
Umem. The algorithm finds the vertex set U ∈ Umem with ∈ Umem, which v has the most connections; that is, ∀U |nb(v) ∩ U| ≥ |nb(v) ∩ U |. This vertex set U can be found in O(d(v)) expected time using a hashtable. Note that we do not keep all vertices in G in the hashtable, but only those in a vertex set in Umem. After finding U , we add v to U . If the memory used for U is as large as the block size B, we write U to disk by Procedure 3. After processing each vertex v, we also check (Lines 11-13 of Algorithm 2) if the memory used by the current partition reaches the available memory space assigned. When this is the case, we reclaim the memory by writing the vertex set currently with the largest memory usage to disk. The writing of a vertex set U to the disk as shown in Procedure 3 proceeds as follows. When we read the vertices in U from the disk (during the scanning of G), we read their edges as well. Let H be the corresponding subgraph; that is, H = (VH , EH ), where VH = U ∪{v : u ∈ U, (u, v) ∈ E} and EH = {(u, v) : u ∈ U, (u, v) ∈ E}. We write H to disk and, at the same time, release the memory occupied by H (also U ). We delay the discussion of the estimation of the upper-bound core number, ψ(v), of each v in H to Section V-B. Before we move on to present our main algorithm for core decomposition, we ask whether we can devise a divide-and- conquer solution that computes core decomposition on each subgraph obtained by the partition, and then merges the results to find the k-cores in the original graph. We note that the divide-and-conquer method does not work because at each conquer phase, this approach still needs to perform search in the merged problem space to determine the core number of each vertex, and hence it demands as much memory as does an in-memory algorithm. B. Upper Bound on Core Number Our top-down core decomposition algorithm requires the selection of a relevant set of vertices based on the upper bound of their core number. In this subsection, we develop heuristics to estimate this upper bound. We use ψ(v) to denote the upper bound on ψ(v) of a vertex v. The following lemma gives a coarse estimation on ψ(v). LEMMA 1: ψ(v) = d(v). Proof: It is trivial to show that ψ(v) = d(v) because ψ(v) ≤ d(v). Lemma 1 serves as an initialization of ψ(v) for each v. After the initialization, we can further refine ψ(v). The basic idea is to refine ψ(v) based on the ψ values of v’s neighbors. Intuitively, v’s neighbors that have ψ values lower than ψ(v) definitely have the true core number lower than ψ(v). These neighbors cannot contribute to the degree of v in the ψ(v)- core and thus if v has many of such neighbors, ψ(v) can be further tightened. We first define a notation ψmax (X) for a non-empty vertex set X to denote the maximum ψ value among all the vertices in X. That is, ψmax (X) = max{ψ(u) : u ∈ X}, where X = ∅. The following lemma describes how to refine ψ(v). LEMMA 2: Given a vertex v ∈ V , where the current ψ(v) > 0, let Z contain all neighbors of v with ψ values lower than ψ(v), that is, Z = {u : u ∈ nb(v), ψ(u) < ψ(v)}. then ψ(v) can be refined as If (d(v) − |Z|) < ψ(v), max{d(v) − |Z|, ψmax (Z)}. Proof: Based on the definition of ψmax (Z), we know that all vertices in Z can only exist in the k-cores for k ≤ ψmax (Z), which means that the edges connecting v and its neighbors in Z cannot exist in the (ψmax (Z) + 1)-core. Therefore, if v exists in the (ψmax (Z) + 1)-core, the degree of v is at most (d(v) − |Z|). If (d(v) − |Z|) > ψmax (Z), we have ψ(v) ≤ (d(v) − |Z|) because it is possible that in the (d(v) − |Z|)-core, v has a degree of (d(v) − |Z|). Otherwise, we have (d(v) − |Z|) ≤ ψmax (Z), which means that v cannot exist in the (ψmax (Z)+ 1)-core, and thus we have ψ(v) ≤ ψmax (Z). In both cases, ψ(v) ≤ max{d(v) − |Z|, ψmax (Z)}. Finally, for the current value of ψ(v), we have (d(v) − |Z|) < ψ(v) (given) and ψmax (Z) < ψ(v) (by the definition of Z). Therefore, max{d(v) − |Z|, ψmax (Z)} is a tighter value of ψ(v), which completes the proof. Lemma 2 can be applied iteratively to replace the current value of ψ(v) whenever a tighter value is possible for ψ(v). However, simply applying Lemma 2 might not be good enough since it is likely that ψmax (Z) > (d(v) − |Z|) especially if Z is not small. As a result, we often choose ψmax (Z) to refine ψ(v), which weakens the effectiveness of Lemma 2 in obtaining a good estimation of ψ(v). To get a better estimation of ψ(v), we apply a finer refinement based on the following corollary of Lemma 2. COROLLARY 1: Given a vertex v ∈ V , where the current ψ(v) > 0, let Z = {u : u ∈ nb(v), ψ(u) < ψ(v)}. For any Z ⊆ Z, Z = ∅, if (d(v) − |Z |) < ψ(v), then ψ(v) can be refined as max{d(v) − |Z |, ψmax (Z )}. Proof: Similar to proof of Lemma 2. Based on Corollary 1, we can select the subset Z of Z that attains the tightest estimation of ψ(v), rather than restricting to the whole set Z for the estimation. Procedure 4 describes our algorithm to estimate ψ(v). Lines 1-2 first initialize ψ(v) to be d(v) by Lemma 1. Then, a refinement is applied to obtain a tighter bound on the core number of each vertex in V (Lines 3-7), as explained in Corollary 1. Note that Line 5 of Procedure 4 uses the condition “d(v)− |Z| < ψ(v)”, while Corollary 1 uses “d(v) − |Z | < ψ(v)”. An incorrect value of ψ(v) may be assigned if Lines 6-7 of with (d(v) − |Z |) ≥ ψ(v). Procedure 4 chooses some Z 55
for each v ∈ V do for each v ∈ V do Procedure 4 Estimate-ψ 1. 2. 3. 4. 5. 6. 7. 8. repeat Steps 3-7; initialize ψ(v) as d(v); let Z = {u : u ∈ nb(v), ψ(u) < ψ(v)}; if(d(v) − |Z| < ψ(v)) let f (X) = max{d(v) − |X|, ψmax (X)}; ψ(v) ← min{f (Z) : Z ⊆ Z, Z = ∅}; However, this Z cannot be selected to refine ψ(v), since Line 7 selects the subset of Z with the lowest f value and Z with (d(v) − |Z|) < ψ(v) is obviously a better refinement. Therefore, Lines 5-7 of Procedure 4 refines ψ(v) as stated in Corollary 1 and the correctness of the refinement of ψ(v) is ensured. In each refinement process as described in Lines 3-7, we always refine the vertices with smaller ψ values before those with higher ones. This strategy allows the refinement to tighten the bounds more quickly since only the neighbors with low ψ values are useful in tightening the bounds (by Corollary 1). The smaller ψ values refined are thus particularly effective in tightening the ψ values for other vertices in the remaining process. |) but the larger may be the value of ψmax (Z To tighten the value of ψ(v), we do not need to compute ) for all subsets of Z in Line 7 of Procedure 4. The f (Z ) in Line 6 of Procedure 4 actually implies definition of f (Z a dilemma: the larger the size of Z , the smaller is the value of (d(v)−|Z ); implies the opposite. Moreover, for subsets while a smaller Z of the same size, clearly those vertices in Z that have a smaller ψ help more in obtaining a smaller ψmax (Z ) than other vertices. Therefore, we can sort the vertices in Z in ascending order of their ψ values. Then, we select Z of size from 1 to |Z|, where the size-i Z simply contains the first i vertices in the ordered Z, until we find that f (Z ) reaches its minimum. In this way, we process at most |Z| number of subsets of Z instead of 2 subsets. |Z| We apply Procedure 4 to estimate ψ(v) for each v in H in Line 2 of Procedure 3. Since H is only a small subgraph of G, the refinement process (Lines 3-7 of Procedure 4) converges quickly. To handle the worst-case scenario, in our algorithm, we set a limit on the number of iterations for the refinement process. We bound the cost of the refinement by the cost of a disk I/O. That is, we terminate the refinement process once the accumulated time for refinement reaches the time for a disk I/O. Therefore, the total cost of our algorithm is always bounded by the number of disk I/Os. In Procedure 3, the subgraph H is associated with a vertex set U in partition U . However, there may be some vertices in H that are not in U but are connected to some vertices in U . Since we may not have the degree of these vertices to initialize their ψ values, we consider their ψ values to be ∞ when we refine ψ(v) for v ∈ U . The following example illustrates how we refine ψ by Procedure 4. Example 3: Assume that the vertex set of the graph in Figure 1 is partitioned into three sets: U1 = {a, b, c, d, e, f}, U2 = {g, h, i, j}, and U3 = {e, k, m, n}. Figure 2 shows the three corresponding subgraphs1. The first number next to each vertex v is the initial ψ(v) = d(v), while x → y indicates that the ψ value is tightened from x to y by Procedure 4. 1 a b 23 d e 24 g h 13 1 c U1 2 i f a, b, c, d, e, f 3 g h 5 4 d e l k f 6 4 i U2 j 1 g, h, i, j n g h i 5 3 3 4 l k m 2 n 3 4 U3 k, l, m, n Fig. 2. Vertex Set Partition and Their Subgraphs ) : Z ⊆ Z, Z Let us consider vertex b in Figure 2(a). We have Z = {a, c} since ψ(a) = ψ(c) = 1 < ψ(b) = 3 at this moment. It is easy = ∅} = 1 in this case and to see min{f (Z thus we obtain ψ(b) = 1. Similarly for vertex d, we compute Z = {b} since now ψ(b) = 1 < ψ(d) = 3. Thus, we obtain ψ(d) = 2. Note that we cannot use g to compute ψ(d) since we do not have the degree or ψ information of g at the current stage, which will become available in a later stage of core decomposition. 2 C. Recursive Top-Down Core Decomposition We now discuss our algorithm EMcore, as outlined in Algo- rithm 5. EMcore first invokes PartitionGraph (i.e., Algorithm 2) to partition G. Then, it invokes Procedure 6, EmCoreStep, to compute the k-classes with k in the range [kl, ku], where kl ≤ ku, recursively from higher value ranges to lower value ones. ku ku kl In order to bound the memory usage, we determine each kl = {v : v ∈ range [kl, ku] as follows. We first define Ψ V, kl ≤ ψ(v) ≤ ku}; that is, Ψ is the set of vertices whose ψ values fall within the range [kl, ku]. At each recursive step, we construct a subgraph that is relevant for computing the in main memory. Let real core number of the vertices in Ψ b = M B . Then, b is the maximum number of blocks that can be held in main memory. We use b and a given ku to determine kl as follows. ku Let K be a set of k values such that the vertices in Ψ k ku kl are scattered in at most b vertex sets in U . That is, {U : U ∈ U , U ∩ Ψ k = ∅} ≤ b ku . K = k : 1 < k ≤ ku, We then define kl = min{k : k ∈ K}, ku, if K = ∅, otherwise. (1) 1We use an adjacency list to store nb(v) of a vertex v. In each subgraph, we only need to keep the adjacency list of those vertices in Ui, where 1 ≤ i ≤ 3. Thus, an edge between two vertices in the same Ui is counted twice, and the three subgraphs have 14, 15, and 15 edges, respectively. 56
Algorithm 5 EMcore Input: G = (V, E), memory size M , disk block size B. Output: The k-class, Ψk, of G for all k. invoke PartitionGraph; 1. 2. compute kl by setting ku = ∞ and b = M B ; 3. invoke EmCoreStep(kl, ku); Procedure 6 EmCoreStep(kl, ku) 1. W ← Ψku kl ; 2. let H be the set of subgraphs of the vertex sets in U read H into main memory to construct the induced subgraph GW by W ; that contain any vertex in W ; 3. 4. 5. 6. deposit any vertex connected to a vertex in Ψk (kl ≤ k ≤ ku); 7. invoke ComputeCore(GW , kl, ku); refine ψ(v) for any vertex v currently in main memory; remove all vertices in Ψk (kl ≤ k ≤ ku) and their edges l in Ψk k from the subgraphs in H, and merge any of these subgraphs if their combined size is less than B; compute k write the subgraphs in H that consist of no vertex in Ψk k to the disk, unless all the subgraphs can now be kept in main memory; invoke EmCoreStep(k u = (kl − 1); by setting k l, k u); u l u l if(kl > 2) 8. 9. 10. 11. 12. else 13. add all the remaining vertices to Ψ1 and output Ψ1; ku ku Intuitively, kl = min{k : k ∈ K} means that we load as many parts of the graph as possible into main memory (bounded by the memory size M ) for core decomposition at each recursive step. However, in the rare case when |K| = ∅ ku ku are in more than b vertex sets in U ), (i.e., the vertices in Ψ we simply assign kl = ku and load b vertex sets in {U : U ∈ U , U ∩ Ψ = ∅} each time into main memory to construct the subgraph for core decomposition. To obtain kl, we keep a heap in main memory to maintain the highest ψ in each vertex set U ∈ U . The highest ψ among the vertices in each U ∈ U is inserted into the heap before U is written to disk during graph partitioning. Given b and ku, we can easily determine the corresponding value of kl by querying the heap. where kl ≤ k ≤ ku, as outlined in Procedure 6. We now discuss how EmCoreStep computes the k-classes, ku kl . In other words, W is the set of candidate vertices whose exact core number is to be computed at this recursive step. The first step is to construct the induced subgraph GW of G by W . We construct GW from the set of subgraphs of the vertex sets in U that contain a vertex in W , since all vertices in W and their inter-connection edges are contained in these subgraphs. During the construction, we can discard those vertices that are not in W , together with their edges. Let W = Ψ After GW is constructed, we invoke ComputeCore, as 57 while (∃v ∈ W , (d(v, GW ) + deposit (v, ku)) < k) remove v and all edges incident to v from GW ; Procedure 7 ComputeCore(GW , kl, ku) initialize ψ(v) to be −1 for each v ∈ W ; 1. 2. k ← kl; 3. while (k ≤ ku) 4. 5. 6. 7. 8. 9. 10. update ψ(v) to be k for each remaining vertex v in GW ; k ← k + 1; for each k, where kl ≤ k ≤ ku, do Ψk ← {v : v ∈ W, ψ(v) = k}; output Ψk; shown in Procedure 7, to compute the k-classes Ψk from GW , where kl ≤ k ≤ ku. However, an immediate question raised here is: since GW is the induced subgraph, is the core number computed from GW correct? We will prove the correctness of our algorithm in Theorem 1, which will also become clearer as we continue to unwrap the details of our algorithm. One key concept to ensure the correctness of our algorithm is the concept of deposit assigned to each vertex (Line 6 of Procedure 6). We define the deposit of a vertex as follows. Definition 2 (DEPOSIT): Given a vertex v ∈ V and an inte- ger ku, where ku > ψ(v), the deposit of v with respect to ku is defined as deposit (v, ku) = |{u : u ∈ nb(v), ψ(u) > ku}|. -core are removed, where k The concept of deposit is used to ensure that the number of edge connections to a vertex v in the k-core, where kl ≤ k ≤ ku, can be correctly counted even if the vertices and edges that belong to all k > ku. After we compute the k-classes, where k ≥ kl, we can deposit a “coin” to each neighbor of a vertex in the k-classes if the neighbor’s core number has not been determined (i.e., the neighbor’s ψ is less than kl and its core number will be computed in a later recursive step). This “coin” accounts for an edge connection to this neighbor vertex in the computation of its core number. Using the deposit, we can safely remove all the vertices and their edges after the core number of these vertices in the current recursive step is computed (Line 7), without worrying about the correctness of the core decomposition in the later recursive steps. In this way, we can save considerable disk I/O costs for reading and writing these large-core-number vertices and their edges again and again in computing the k-classes with a smaller core number. Procedure 7 computes the correct core number of the vertices in the induced subgraph GW as follows. The algorithm iteratively deletes the lowest degree vertex and its edges as in an in-memory algorithm. However, in selecting the lowest degree vertex, the algorithm considers the sum of the local degree in GW and the deposit of the vertex. We remark that the initial value of the deposit of all vertices with respect to ku = ∞ is set to 0. We prove the correctness of Procedure 7 in Lemma 3. After we compute Ψk, where kl ≤ k ≤ ku, we return to Lines 5-6 of Procedure 6. We need to refine the upper bound on the core number of remaining vertices as well as updating
their deposit. We first refine ψ(v) for any vertex v that is currently in main memory but does not have its core number computed. The refinement of ψ(v) involves two steps: (1) We first set ψ(v) to be (kl − 1) if currently ψ(v) ≥ kl. This is because if ψ(v) ≥ kl, then v must be processed in the current recursive step. However, since v’s core number is not determined, v is not in any Ψk, where kl ≤ k ≤ ku, which implies that ψ(v) ≤ (kl − 1). (2) We then refine all v by Procedure 4, since now we have the exact core number of more vertices computed, which can be used to tighten the ψ value of the remaining vertices. After the refinement, we have ψ(v) < kl ≤ ψ(u) for all v whose core number has not been determined and any u in Ψk, where k ≥ kl. Then, Line 6 updates the deposit of a u = (kl − 1). The deposit (v, kl − 1) vertex v with respect to k can be calculated as (deposit (v, ku) + |{u : u ∈ nb(v), u ∈ kl≤k≤ku Ψk}|), where the first part is the number of edge connections to v that were removed at previous recursive steps, while the second part is the number of edge connections to v that are to be removed at the current recursive step. Therefore, deposit (v, kl − 1) ensures the correct computation of ψ(v) at later recursive steps. After the update of the deposit, we can safely remove all vertices in Ψk, where kl ≤ k ≤ ku, and their edges from the subgraphs in H (Line 7 of Procedure 6). Then, we merge any of these subgraphs if their combined size is less than B, so as to further reduce the disk I/O cost. Furthermore, we keep those subgraphs that will be used in the next recursive step in main memory and only write the rest to disk, or keep all the subgraphs in main memory if they are now small enough to be all kept in main memory (Line 10 of Procedure 6). The recursive step goes on to process the next range of k, l, kl− 1], until kl = 2 (Lines 8-11 of Procedure 6). Note that [k the upper bound of the range, k u, for the next recursive step now changes to (kl − 1). This is because, after the refinement in Line 5 in the current recursive step, we have ψ(v) ≤ (kl−1) for any vertex v whose core number has not been determined. When kl = 2, the remaining vertices must be all in the 1-class, Ψ1 (Lines 12-13). There is still a small problem: are there any vertices in the 0-class? The case of k = 0, i.e., the 0-class, means that no vertex in this class is connected to any other vertices. Thus, the 0-class contains only isolated vertices in the graph, which can be obtained trivially by one scan of the graph. To do this, we add a check in Algorithm 2: as we scan the input graph, if a vertex has a degree of 0, we output it as a 0-class vertex. The following example explains how the top-down core decomposition is processed. Example 4: Let us consider the three subgraphs given in Figure 2 and suppose b = 2. Then, we obtain kl = 3 and W = ∞ 3 = {g, h, i, k, l, n}. We load the subgraphs in Figures 2(b) Ψ and 2(c) into main memory to construct the induced subgraph by W , GW , which is shown in Figure 3(a), where the number next to each vertex v is the current ψ(v). It is then easy to obtain the 3-class, {h, i, k, l, n}, by Procedure 7. 3 3 3 3 4 4 2 2 2 2 2 Fig. 3. Induced Subgraph GW After computing the 3-class, we can refine ψ(g) from 3 to 2. We also update the deposit of any vertex that is connected u = (kl − 1) = 2, as shown in to a vertex in the 3-class for k the following table. TABLE II deposit (v, 2) WHERE v ∈ {e, f, g, j, m} j m v deposit (v, 2) 1 2 f 1 e 2 g 2 Next we compute for k Now we can remove all the edges connected to any vertex in {h, i, k, l, n}. After removing the edges, we only have 1 edge (d, g) and 3 isolated vertices e, f and j in Figure 2(b), and only 2 isolated vertices g and m in Figure 2(c). u = 2 and we load the subgraph in Figure 2(a) (and also Figure 2(c) which is still in main mem- ory) to construct the induced subgraph by W = {d, e, f, g, m}, as shown in Figure 3(b). By using the deposit, it is easy to obtain the 2-class, {d, e, f, g, m}, by Procedure 7. {a, b, c, j}, belong to the 1-class. the remaining vertices, 2 After computing the 2-class, all D. Analysis of Algorithm In this subsection, we provide the theoretical analysis re- garding the correctness as well as the complexity of our algorithm EMcore. ku LEMMA 3: Given W = Ψ kl and GW , for any v ∈ W , if ψ(v) ≥ kl, then ComputeCore (i.e., Procedure 7) correctly computes ψ(v). Proof: If ψ(v) ≥ kl, to compute ψ(v) we only need the kl-core of G, Ckl. In GW , the largest core number that a vertex can have is ku. Thus, in computing ψ(v) using GW instead of Ckl, we are missing the edge set {(u, v) : v ∈ W, ψ(u) > ku, (u, v) ∈ E}, where kl ≤ ψ(v) ≤ ku. Since our algorithm follows a top-down approach, ∀v ∈ W , if ∃u ∈ nb(v), ψ(u) > ku, then ψ(u) must have already been computed in some previous recursive step and deposited to deposit (v, ku). Thus, we have (d(v, GW ) + deposit (v, ku)) = d(v, Ckl ) for all v ∈ W . Since ComputeCore removes a vertex v only if (d(v, GW ) + deposit (v, ku)) < k, or equivalently, d(v, Ckl ) < k, it correctly computes ψ(v) for all v ∈ W , ψ(v) ≥ kl. Using Lemma 3, we prove the correctness of our main algorithm EMcore. 58
分享到:
收藏