Efficient Core Decomposition in Massive Networks
James Cheng1, Yiping Ke2, Shumo Chu3, M. Tamer ¨Ozsu4
1,3School of Computer Engineering
Nanyang Technological University
Nanyang Avenue, Singapore
1j.cheng@acm.org
3shumo@pmail.ntu.edu.sg
2Shenzhen Institutes of Advanced Technology
Chinese Academy of Science, China
2yp.ke@siat.ac.cn
4David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada
4tamer.ozsu@uwaterloo.ca
Abstract—The k-core of a graph is the largest subgraph in
which every vertex is connected to at least k other vertices within
the subgraph. Core decomposition finds the k-core of the graph
for every possible k. Past studies have shown important applica-
tions of core decomposition such as in the study of the properties
of large networks (e.g., sustainability, connectivity, centrality,
etc.), for solving NP-hard problems efficiently in real networks
(e.g., maximum clique finding, densest subgraph approximation,
etc.), and for large-scale network fingerprinting and visualization.
The k-core is a well accepted concept partly because there exists
a simple and efficient algorithm for core decomposition, by
recursively removing the lowest degree vertices and their incident
edges. However, this algorithm requires random access to the
graph and hence assumes the entire graph can be kept in main
memory. Nevertheless, real-world networks such as online social
networks have become exceedingly large in recent years and
still keep growing at a steady rate. In this paper, we propose
the first external-memory algorithm for core decomposition in
massive graphs. When the memory is large enough to hold the
graph, our algorithm achieves comparable performance as the
in-memory algorithm. When the graph is too large to be kept
in the memory, our algorithm requires only O(kmax ) scans of
the graph, where kmax is the largest core number of the graph.
We demonstrate the efficiency of our algorithm on real networks
with up to 52.9 million vertices and 1.65 billion edges.
I. INTRODUCTION
Given a graph G, the k-core of G is the largest subgraph
of G in which every vertex has degree of at least k within the
subgraph [1]. The problem of core decomposition in G is to
find the k-core of G for all k.
Core decomposition has been shown to be an important
concept in the study of graph properties and has many signifi-
cant applications in network analysis. It was first introduced to
simplify graph topology to aid in the analysis and visualization
of networks [1], [2]. It was then recognized as an important
tool for visualization of complex networks and interpretation
of cooperative processes in them [3], [4]. It was used to
analyze complex networks [5], in particular their hierarchies,
self-similarity, centrality, and connectivity. It was employed to
find structural fingerprints of large-scale networks and design
effective visualization tools [6]. It was used to describe the
architecture of randomly damaged uncorrelated networks [7].
It was applied to predict protein functions based on the k-
cores of protein-protein interaction networks and amino acid
sequences [8]. It was used to analyze the static structure
of large-scale software systems [9]. In addition, hierarchical
degree core tree [10] was proposed for summarizing the
structure of massive graphs and performing model validation.
In graph theory, the concept of k-core has been extensively
studied in random graphs to understand various graph prop-
erties [11], [12], [13], [14], [15]. The k-cores can be used as
heuristics for maximum clique finding since a clique of size k
is guaranteed to be in a (k−1)-core, which can be significantly
smaller than the original graph. Moreover, core decomposition
can be applied to give a (1/2)-approximation algorithm for
the densest subgraph problem [16] and a (1/3)-approximation
algorithm for the densest at-least-k-subgraph problem [17]
in linear time. It can also be used as an approximation of
betweenness score [10] and to discover dense clusters in noisy
spatial data [18].
Compared with the computation of other similar concepts
of cohesive groups in a network [19], such as cliques, n-
cliques, n-clans, k-plexes, f -groups, n-clubs, lambda sets,
most of which are algorithmically difficult (NP-hard or at least
quadratic), there exists a simple and efficient algorithm for
computing k-cores.
Given a graph G, we can compute the k-core of G by re-
cursively deleting all the vertices (together with their incident
edges) with degree less than k. Batagelj and Zaversnik [20]
propose a linear algorithm for core decomposition, which uses
bin-sort to order the vertices and recursively deletes the vertex
with the lowest degree.
This algorithm, however, requires random accesses to the
graph and thus assumes that the whole graph can be kept in
main memory. Unfortunately, many real-world networks have
grown exceedingly large in recent years and are continuing
to grow at a steady rate. For example, the Web graph has
978-1-4244-8960-2/11/$26.00 © 2011 IEEE
51
ICDE Conference 2011
over 1 trillion webpages (Google), most social networks (e.g.,
Facebook, MSN) have hundreds of millions to billions of
participants, many citation networks (e.g., DBLP, Citeseer)
have millions of publications, other networks such as phone-
call networks, email networks, stock-market networks, etc., are
also massively large.
In this paper, we develop the first external-memory al-
gorithm for core decomposition in a massive network. The
result of this research can be readily applied to design an
efficient solution or approximation of many important but
high-complexity problems in massive real networks which
cannot fit in main memory. Quite a few of these problems
are described above, and all of these problems assume the
existence of a linear in-memory algorithm to find the k-cores.
All existing in-memory algorithms are bottom-up ap-
proaches that compute a k-core before a (k+1)-core. However,
a k-core is a supergraph of a (k + 1)-core and the 0-core is
essentially the entire graph. Therefore, the bottom-up approach
cannot be adopted to design an efficient external-memory
algorithm.
We devise a novel
top-down approach that recursively
computes the k-cores from larger values of k to smaller ones,
and progressively reduces search space and disk I/O cost by
removing the vertices in each computed k-core. Our algo-
rithm, EMcore, consists of three key components: an efficient
strategy for graph partitioning, an effective mechanism for
estimating the upper bound of the core number of the vertices,
and a recursive top-down core decomposition procedure. Our
graph partitioning cuts the original graph into small blocks by
only one scan of the graph. This allows core decomposition
to load only the relevant blocks into main memory at each
recursive step. The estimation of the upper bound on the
core number of vertices ensures the correctness of our top-
down approach as well as reducing the overall search space.
We develop an effective mechanism for estimating the upper
bound and progressively refining it at each recursive step.
Based on the upper bound, our top-down strategy identifies a
range of k-cores to be computed in main memory by loading
only the relevant blocks.
We prove that our algorithm uses only O(kmax ) scans of
the graph in the worst case, where kmax is the largest possible
value of k for any k-core in the graph. In practice, however,
our experimental results show that the actual amount of I/Os
required is significantly less than that required for kmax scans
of the entire graph.
Our experimental results on various massive real networks
verify that our algorithm is both CPU-efficient and I/O-
efficient. When main memory is sufficient to hold the entire
network, our top-down algorithm achieves comparable perfor-
mance to the state-of-the-art in-memory algorithm for core
decomposition [20]. When main memory is not sufficient to
keep the entire network, our algorithm is able to perform core
decomposition efficiently for networks with up to 52.9 million
vertices and 1.65 billion edges, while the in-memory algorithm
fails in these large graphs.
TABLE I
NOTATIONS
Description
Size of main memory (internal memory)
Size of disk (external memory) block
Number of vertices in graph G = (V, E)
Number of edges in graph G = (V, E)
Size of G, defined as |G| = |E| = m
The set of neighbors of a vertex v in G / G
The degree of v in G / G
The k-core of G
The maximum core number of any vertex in G
Core number of v in G, the largest k s.t. v ∈ VCk
The upper bound on the core number of v in G
The k-class of G, Ψk = {v : v ∈ V, ψ(v) = k}
The maximum ψ value among all vertices in X
The number of v’s neighbors with ψ > ku
Symbol
M
B
n
m
|G|
)
nb(v); nb(v, G
)
Ck = (VCk , ECk
d(v); d(v, G
)
kmax
ψ(v)
ψ(v)
Ψk
ψmax (X)
deposit (v, ku)
Organization. Section II formally defines the problem and
gives the basic notations. Section III describes the in-memory
algorithm. Section IV presents the overall framework of our
solution. Section V details the EMcore algorithm. Section VI
reports the experimental results. Section VII discusses the
related work. Section VIII gives the conclusion.
II. NOTATIONS AND PROBLEM DEFINITION
In this paper, we focus on large networks that are modeled
as graphs. For simplicity of presentation, we discuss our
algorithm for undirected graphs only. The algorithm can be
extended to handle directed graphs in a way similar to the
adaption of the in-memory algorithm for directed graphs [20].
Table I shows the notations used throughout the paper.
For external-memory algorithm analysis, we use the standard
I/O model [21] with the following parameters: M is the
main memory size and B is the disk block size, where
1 B ≤ M/2. In most practical cases, we consider the
internal memory as the main memory and the external memory
as the disk, though other memory hierarchies are also possible.
Let G = (V, E) be an undirected and unlabeled graph. We
define n = |V | and m = |E|. We define the size of G, denoted
as |G|, as the number of edges in G, i.e., |G| = m. We define
the set of neighbors of a vertex v in G as nb(v) = {u :
(u, v) ∈ E}, and the degree of v in G as d(v) = |nb(v)|.
= (VG , EG ) of G, we define
Similarly, for a subgraph G
)|.
nb(v, G
The k-core [1] of G is the largest subgraph Ck =
(VCk , ECk ) of G such that ∀v ∈ VCk, d(v, Ck) ≥ k. The
core number of a vertex v ∈ V , denoted as ψ(v), is defined
as the largest k such that v is in Ck, i.e., “ψ(v) = k” means
that v ∈ VCk and v /∈ VCk+1. We further define the notion of
k-class as follows.
) = {u : (u, v) ∈ EG} and d(v, G
) = |nb(v, G
Definition 1 (k-CLASS): Given a graph G = (V, E), the
k-class, Ψk, of G is defined as Ψk = {v : v ∈ V, ψ(v) = k}.
In particular, the 0-class, Ψ0, is the set of vertices in G with
degree of 0.
52
The problem of core decomposition is: given a graph G,
find the k-core of G for k = 0, 1, . . . , kmax , where kmax is the
maximum core number of any vertex in G. Equivalently, the
problem is to find the k-class of G for k = 0, 1, . . . , kmax . In
this paper, we propose an external-memory algorithm that finds
the k-classes of G when main memory is not large enough to
hold the whole graph G. From the k-classes, we can easily
obtain any k-core as the induced subgraph of G by the vertex
set Vk =
k≤i≤kmax Ψi.
The following example illustrates the concept of core de-
composition.
Example 1: Figure 1 shows a graph G that contains 14
vertices, a, . . . , n. The 0-class, Ψ0, of G is an empty set
since there is no isolated vertex in G. The 1-class Ψ1 =
{a, b, c, j}, the 2-class Ψ2 = {d, e, f, g, m}, and the 3-class
Ψ3 = {h, i, k, l, n}. In this example, we have kmax = 3. The
0≤i≤3 Ψi.
0-core is the induced subgraph of G by vertex set
Since Ψ0 = ∅ in this example, the 0-core is the same as the
1-core, which is simply the entire graph. The 2-core is the
induced subgraph by (Ψ2 ∪ Ψ3) and the 3-core is the induced
subgraph by Ψ3. One can easily verify that in the k-core,
every vertex is connected to at least k other vertices, where
0 ≤ k ≤ 3.
2
Fig. 1. A Graph and Its k-classes
III. IN-MEMORY CORE DECOMPOSITION
In this section, we describe the existing in-memory al-
gorithm for core decomposition and explain why it is not
suitable to be extended to an external-memory algorithm. We
also discuss the advantage of the top-down approach over the
bottom-up approach for core decomposition in large graphs.
Algorithm 1 describes the skeleton of the existing in-
memory algorithm for core decomposition [20]. It is a bottom-
up approach as the algorithm starts the computation of the k-
class from smaller values of k and moves to larger values of
k.
The algorithm first sorts the vertices in ascending order of
their degree (among the vertices with the same degree, the
ordering can be arbitrary). Then, it starts to compute the d-
class Ψd, where d is the minimum vertex degree in G. It
removes from G all vertices with degree of d, together with
all the edges incident to them, and puts these vertices into Ψd.
After removing these vertices and their edges, the degree of
some vertices that are previously connected with the removed
vertices decreases. If any vertices remaining in G now have
53
Algorithm 1 BottomUp
Input: G = (V, E).
Output: The k-class, Ψk, of G for all k.
let d be the minimum vertex degree in G;
Ψd ← ∅;
while (there exists a vertex v with degree of at most d)
1. order the vertices in G in ascending order of their degree;
2. while (G is not empty)
3.
4.
5.
6.
7.
8.
Ψd ← Ψd ∪ {v};
remove v and all edges incident to v from G;
re-order the remaining vertices in G
in ascending order of their degree;
9. output all Ψk;
degree of d or less, they cannot be in a k-class where k > d,
and thus must be in Ψd. Therefore, they are added to Ψd
and removed from G. This process continues until all vertices
remaining in G have degree greater than d. Then, the algorithm
moves on to the next iteration to compute the next k-class
where k > d. The algorithm terminates when all vertices, and
their edges, are removed from G.
The following example further explains the bottom-up com-
putation.
Example 2: Suppose that we compute the k-classes of the
graph in Figure 1 by Algorithm 1. Line 1 sorts the vertices
as follows: {a(1), c(1), j(1), f (2), m(2), b(3), d(3), g(3),
e(4), k(4), n(4), h(5), l(5), i(6)}, where the number in the
parentheses indicates the degree of the vertex in the current
G. Then, in the first iteration (Lines 3-8) that computes the 1-
class, the vertices a, c and j are first removed from G together
with their incident edges since their degree is 1. After that, the
vertices are re-ordered as {b(1), f (2), m(2), d(3), g(3), e(4),
k(4), n(4), h(5), l(5), i(5)}. Then vertex b is also removed
because after removing a and c, the degree of b becomes 1. At
this point, the ordered vertex list becomes {f (2), m(2), d(2),
g(3), e(4), k(4), n(4), h(5), l(5), i(5)}. The first iteration
completes and obtains Ψ1 = {a, c, j, b}. In the second iteration
that computes the 2-class, the vertices removed are (in the
order): f , m, d, g, and e. In the third iteration that computes
the 3-class, the vertices removed are (in the order): n, k, h, l,
and i.
2
When main memory is sufficient to hold the entire input
graph, the most costly step of the in-memory algorithm is
sorting the vertices according to their degree at each iteration
(Line 8). Batagelj and Zaversnik [20] use bin-sort to order the
vertices to achieve O(m + n) time complexity.
When the graph is too large to be placed in main memory,
the bottom-up approach fails because it requires memory space
of Ω(m + n). The bottom-up approach is not suitable for
designing an efficient external-memory algorithm since it starts
core decomposition over the entire graph to compute the k-
class with the smallest k. Although sorting can be done in
O( n
n
B ) I/Os (assuming that we store separately the
B log M
B
vertices and their degree on consecutive disk blocks), it is non-
trivial to address the problem of random accesses to vertices
and edges on disk during the computation.
In this paper, we propose a top-down approach, which
first computes the k-class for larger k and then moves down
to smaller values of k. The top-down approach is more
suitable for designing an external-memory algorithm. An-
other advantage of the top-down approach over the bottom-
up approach is that most applications and algorithm designs
using k-cores only need the k-cores with larger values of
k, and sometimes, only the kmax -core (e.g., for defining the
nucleus of a communication network [4], for predicting protein
functions [8], for approximating the densest subgraph [16],
etc.). The top-down approach can obtain the k-cores with
larger k without computing those with smaller k. This can
result in huge savings of computational resources for handling
massive networks, because the smaller the value of k, the
larger is the graph from which the k-core is computed.
IV. OVERALL FRAMEWORK
In this section, we first give the overall framework of our
external-memory algorithm, called EMcore, for core decom-
position, and then discuss the details of each part in the next
section.
1) Graph partitioning:
The first step of EMcore is to partition the vertex set
V of G, so that core decomposition by EMcore in later
steps can be processed by loading only the correspond-
ing subgraphs of the relevant subsets of vertices (instead
of the whole graph) in main memory. We describe an
efficient partitioning strategy that uses only one scan of
G.
2) Estimation of the upper bound on the core number:
We develop heuristics to estimate the upper bound on
the core number of the vertices in each subset in the
partition. The upper bound is refined progressively at
each step of core decomposition.
3) Recursive top-down core decomposition based on the
upper-bound core number:
The main step of EMcore is a recursive procedure
of core decomposition. Based on the upper bound on
the core number of the vertices, EMcore recursively
computes the k-classes on a set of vertices whose core
number falls within a range such that the corresponding
relevant subgraph can be kept in main memory. Our
algorithm is top-down, meaning that we compute the k-
classes with ranges of larger core numbers before those
of smaller core numbers.
At the end of each recursive step, we remove from G
the k-classes already computed, as well as their incident
edges, to reduce the search space and disk I/O cost for
the next recursion of core decomposition.
Algorithm 2 PartitionGraph
Input: G = (V, E), memory size M , disk block size B.
Output: A partition of disjoint vertex sets, U = {U1, . . . , Ul}.
// the part of U that are currently in memory
1. create an empty partition U ;
2. Umem = U ;
for each vertex, v ∈ V , do
3.
4.
5.
6.
7.
else
if(v is not connected to a vertex in any vertex set in Umem)
create a new set U = {v} and add U to Umem ;
find the vertex set U ∈ Umem to which
v has the most connections;
add v to U;
if(
u∈U d(u) = B)
WriteBlock(U);
// memory used for U reaches B
8.
9.
10.
11.
12.
13.
if(
u∈U d(u) = M)
U∈Umem
let Umax be the vertex set in Umem s.t.
d(u) ≥
∀U ∈ Umem ,
WriteBlock(Umax );
u∈Umax
u∈U d(u);
Procedure 3 WriteBlock(U )
1.
let H be the subgraph of G that consists of all edges
connected to the vertices in U;
2. estimate ψ(v) for each vertex v in H;
3. write H to the disk and remove U from Umem ;
V. EXTERNAL-MEMORY CORE DECOMPOSITION
In this section, we discuss in detail the three parts in the
overall framework, followed by a detailed analysis of the
algorithm.
A. Graph Partitioning
The purpose of graph partitioning is to decompose the orig-
inal large graph into a set of smaller subgraphs, so that core
decomposition can be processed by loading only those relevant
subgraphs into main memory. There are many existing graph
partitioning algorithms [22], [23], but they are in-memory
algorithms. For core decomposition in massive networks that
cannot fit into main memory, we need an efficient algorithm
that partitions a large graph with limited memory consumption.
To this end, we devise a simple graph partitioning algorithm
that requires only one scan of the original graph and has linear
CPU time complexity.
The outline of the partitioning algorithm is given in Algo-
rithm 2. The algorithm reads the input graph G from external
memory once and partitions V into a set of disjoint vertex sets,
U = {U1, . . . , Ul}, where
1≤i≤l Ui = V and Ui ∩ Uj = ∅
for all i = j.
For each vertex v, the algorithm processes v as follows.
There are two cases. The first case (Lines 4-5 of Algorithm
2) is that v is not connected to any vertex in any vertex set in
Umem (the part of U that are currently in main memory). In
this case, we simply create a new vertex set U with a single
vertex v and add U to Umem .
The second case (Lines 6-10 of Algorithm 2) is that v is
connected to some vertex(es) in some existing vertex set(s) in
54
Umem. The algorithm finds the vertex set U ∈ Umem with
∈ Umem,
which v has the most connections; that is, ∀U
|nb(v) ∩ U| ≥ |nb(v) ∩ U
|. This vertex set U can be found
in O(d(v)) expected time using a hashtable. Note that we do
not keep all vertices in G in the hashtable, but only those in
a vertex set in Umem. After finding U , we add v to U . If the
memory used for U is as large as the block size B, we write
U to disk by Procedure 3.
After processing each vertex v, we also check (Lines 11-13
of Algorithm 2) if the memory used by the current partition
reaches the available memory space assigned. When this is
the case, we reclaim the memory by writing the vertex set
currently with the largest memory usage to disk.
The writing of a vertex set U to the disk as shown in
Procedure 3 proceeds as follows. When we read the vertices
in U from the disk (during the scanning of G), we read their
edges as well. Let H be the corresponding subgraph; that is,
H = (VH , EH ), where VH = U ∪{v : u ∈ U, (u, v) ∈ E} and
EH = {(u, v) : u ∈ U, (u, v) ∈ E}. We write H to disk and,
at the same time, release the memory occupied by H (also U ).
We delay the discussion of the estimation of the upper-bound
core number, ψ(v), of each v in H to Section V-B.
Before we move on to present our main algorithm for core
decomposition, we ask whether we can devise a divide-and-
conquer solution that computes core decomposition on each
subgraph obtained by the partition, and then merges the results
to find the k-cores in the original graph. We note that the
divide-and-conquer method does not work because at each
conquer phase, this approach still needs to perform search in
the merged problem space to determine the core number of
each vertex, and hence it demands as much memory as does
an in-memory algorithm.
B. Upper Bound on Core Number
Our top-down core decomposition algorithm requires the
selection of a relevant set of vertices based on the upper bound
of their core number. In this subsection, we develop heuristics
to estimate this upper bound.
We use ψ(v) to denote the upper bound on ψ(v) of a vertex
v. The following lemma gives a coarse estimation on ψ(v).
LEMMA 1: ψ(v) = d(v).
Proof: It is trivial to show that ψ(v) = d(v) because
ψ(v) ≤ d(v).
Lemma 1 serves as an initialization of ψ(v) for each v.
After the initialization, we can further refine ψ(v). The basic
idea is to refine ψ(v) based on the ψ values of v’s neighbors.
Intuitively, v’s neighbors that have ψ values lower than ψ(v)
definitely have the true core number lower than ψ(v). These
neighbors cannot contribute to the degree of v in the ψ(v)-
core and thus if v has many of such neighbors, ψ(v) can be
further tightened.
We first define a notation ψmax (X) for a non-empty vertex
set X to denote the maximum ψ value among all the vertices
in X. That is, ψmax (X) = max{ψ(u) : u ∈ X}, where
X = ∅.
The following lemma describes how to refine ψ(v).
LEMMA 2: Given a vertex v ∈ V , where the current
ψ(v) > 0, let Z contain all neighbors of v with ψ values
lower than ψ(v), that is, Z = {u : u ∈ nb(v), ψ(u) < ψ(v)}.
then ψ(v) can be refined as
If (d(v) − |Z|) < ψ(v),
max{d(v) − |Z|, ψmax (Z)}.
Proof: Based on the definition of ψmax (Z), we know
that all vertices in Z can only exist
in the k-cores for
k ≤ ψmax (Z), which means that the edges connecting v and
its neighbors in Z cannot exist in the (ψmax (Z) + 1)-core.
Therefore, if v exists in the (ψmax (Z) + 1)-core, the degree
of v is at most (d(v) − |Z|).
If (d(v) − |Z|) > ψmax (Z), we have ψ(v) ≤ (d(v) − |Z|)
because it is possible that in the (d(v) − |Z|)-core, v has a
degree of (d(v) − |Z|). Otherwise, we have (d(v) − |Z|) ≤
ψmax (Z), which means that v cannot exist in the (ψmax (Z)+
1)-core, and thus we have ψ(v) ≤ ψmax (Z). In both cases,
ψ(v) ≤ max{d(v) − |Z|, ψmax (Z)}.
Finally, for the current value of ψ(v), we have (d(v) −
|Z|) < ψ(v) (given) and ψmax (Z) < ψ(v) (by the definition
of Z). Therefore, max{d(v) − |Z|, ψmax (Z)} is a tighter
value of ψ(v), which completes the proof.
Lemma 2 can be applied iteratively to replace the current
value of ψ(v) whenever a tighter value is possible for ψ(v).
However, simply applying Lemma 2 might not be good enough
since it is likely that ψmax (Z) > (d(v) − |Z|) especially if
Z is not small. As a result, we often choose ψmax (Z) to
refine ψ(v), which weakens the effectiveness of Lemma 2 in
obtaining a good estimation of ψ(v).
To get a better estimation of ψ(v), we apply a finer
refinement based on the following corollary of Lemma 2.
COROLLARY 1: Given a vertex v ∈ V , where the current
ψ(v) > 0, let Z = {u : u ∈ nb(v), ψ(u) < ψ(v)}.
For any Z
⊆ Z, Z
= ∅, if (d(v) − |Z
|) < ψ(v), then
ψ(v) can be refined as max{d(v) − |Z
|, ψmax (Z
)}.
Proof: Similar to proof of Lemma 2.
Based on Corollary 1, we can select the subset Z
of Z that
attains the tightest estimation of ψ(v), rather than restricting
to the whole set Z for the estimation.
Procedure 4 describes our algorithm to estimate ψ(v). Lines
1-2 first initialize ψ(v) to be d(v) by Lemma 1. Then, a
refinement is applied to obtain a tighter bound on the core
number of each vertex in V (Lines 3-7), as explained in
Corollary 1.
Note that Line 5 of Procedure 4 uses the condition “d(v)−
|Z| < ψ(v)”, while Corollary 1 uses “d(v) − |Z
| < ψ(v)”.
An incorrect value of ψ(v) may be assigned if Lines 6-7 of
with (d(v) − |Z
|) ≥ ψ(v).
Procedure 4 chooses some Z
55
for each v ∈ V do
for each v ∈ V do
Procedure 4 Estimate-ψ
1.
2.
3.
4.
5.
6.
7.
8.
repeat Steps 3-7;
initialize ψ(v) as d(v);
let Z = {u : u ∈ nb(v), ψ(u) < ψ(v)};
if(d(v) − |Z| < ψ(v))
let f (X) = max{d(v) − |X|, ψmax (X)};
ψ(v) ← min{f (Z) : Z ⊆ Z, Z = ∅};
However, this Z
cannot be selected to refine ψ(v), since
Line 7 selects the subset of Z with the lowest f value and
Z with (d(v) − |Z|) < ψ(v) is obviously a better refinement.
Therefore, Lines 5-7 of Procedure 4 refines ψ(v) as stated in
Corollary 1 and the correctness of the refinement of ψ(v) is
ensured.
In each refinement process as described in Lines 3-7, we
always refine the vertices with smaller ψ values before those
with higher ones. This strategy allows the refinement to tighten
the bounds more quickly since only the neighbors with low
ψ values are useful in tightening the bounds (by Corollary
1). The smaller ψ values refined are thus particularly effective
in tightening the ψ values for other vertices in the remaining
process.
|) but the larger may be the value of ψmax (Z
To tighten the value of ψ(v), we do not need to compute
) for all subsets of Z in Line 7 of Procedure 4. The
f (Z
) in Line 6 of Procedure 4 actually implies
definition of f (Z
a dilemma: the larger the size of Z
, the smaller is the value
of (d(v)−|Z
);
implies the opposite. Moreover, for subsets
while a smaller Z
of the same size, clearly those vertices in Z that have a smaller
ψ help more in obtaining a smaller ψmax (Z
) than other
vertices. Therefore, we can sort the vertices in Z in ascending
order of their ψ values. Then, we select Z
of size from 1 to
|Z|, where the size-i Z
simply contains the first i vertices in
the ordered Z, until we find that f (Z
) reaches its minimum.
In this way, we process at most |Z| number of subsets of Z
instead of 2
subsets.
|Z|
We apply Procedure 4 to estimate ψ(v) for each v in H in
Line 2 of Procedure 3. Since H is only a small subgraph of G,
the refinement process (Lines 3-7 of Procedure 4) converges
quickly. To handle the worst-case scenario, in our algorithm,
we set a limit on the number of iterations for the refinement
process. We bound the cost of the refinement by the cost of
a disk I/O. That is, we terminate the refinement process once
the accumulated time for refinement reaches the time for a
disk I/O. Therefore, the total cost of our algorithm is always
bounded by the number of disk I/Os.
In Procedure 3, the subgraph H is associated with a vertex
set U in partition U . However, there may be some vertices in
H that are not in U but are connected to some vertices in U .
Since we may not have the degree of these vertices to initialize
their ψ values, we consider their ψ values to be ∞ when we
refine ψ(v) for v ∈ U . The following example illustrates how
we refine ψ by Procedure 4.
Example 3: Assume that the vertex set of the graph in
Figure 1 is partitioned into three sets: U1 = {a, b, c, d, e, f},
U2 = {g, h, i, j}, and U3 = {e, k, m, n}. Figure 2 shows the
three corresponding subgraphs1. The first number next to each
vertex v is the initial ψ(v) = d(v), while x → y indicates that
the ψ value is tightened from x to y by Procedure 4.
1
a
b
23
d
e
24
g
h
13
1
c
U1
2
i
f
a, b, c, d, e, f
3
g
h
5
4
d
e
l
k
f
6
4
i
U2
j
1
g, h, i, j
n
g
h
i
5
3
3
4
l
k
m
2
n
3
4
U3
k, l, m, n
Fig. 2. Vertex Set Partition and Their Subgraphs
) : Z
⊆ Z, Z
Let us consider vertex b in Figure 2(a). We have Z = {a, c}
since ψ(a) = ψ(c) = 1 < ψ(b) = 3 at this moment. It is easy
= ∅} = 1 in this case and
to see min{f (Z
thus we obtain ψ(b) = 1. Similarly for vertex d, we compute
Z = {b} since now ψ(b) = 1 < ψ(d) = 3. Thus, we obtain
ψ(d) = 2. Note that we cannot use g to compute ψ(d) since we
do not have the degree or ψ information of g at the current
stage, which will become available in a later stage of core
decomposition.
2
C. Recursive Top-Down Core Decomposition
We now discuss our algorithm EMcore, as outlined in Algo-
rithm 5. EMcore first invokes PartitionGraph (i.e., Algorithm
2) to partition G. Then, it invokes Procedure 6, EmCoreStep,
to compute the k-classes with k in the range [kl, ku], where
kl ≤ ku, recursively from higher value ranges to lower value
ones.
ku
ku
kl
In order to bound the memory usage, we determine each
kl = {v : v ∈
range [kl, ku] as follows. We first define Ψ
V, kl ≤ ψ(v) ≤ ku}; that is, Ψ
is the set of vertices whose
ψ values fall within the range [kl, ku]. At each recursive step,
we construct a subgraph that is relevant for computing the
in main memory. Let
real core number of the vertices in Ψ
b = M
B . Then, b is the maximum number of blocks that can be
held in main memory. We use b and a given ku to determine
kl as follows.
ku
Let K be a set of k values such that the vertices in Ψ
k
ku
kl
are scattered in at most b vertex sets in U . That is,
{U : U ∈ U , U ∩ Ψ
k = ∅} ≤ b
ku
.
K =
k : 1 < k ≤ ku,
We then define
kl =
min{k : k ∈ K},
ku,
if K = ∅,
otherwise.
(1)
1We use an adjacency list to store nb(v) of a vertex v. In each subgraph, we
only need to keep the adjacency list of those vertices in Ui, where 1 ≤ i ≤ 3.
Thus, an edge between two vertices in the same Ui is counted twice, and the
three subgraphs have 14, 15, and 15 edges, respectively.
56
Algorithm 5 EMcore
Input: G = (V, E), memory size M , disk block size B.
Output: The k-class, Ψk, of G for all k.
invoke PartitionGraph;
1.
2. compute kl by setting ku = ∞ and b = M
B ;
3.
invoke EmCoreStep(kl, ku);
Procedure 6 EmCoreStep(kl, ku)
1. W ← Ψku
kl ;
2.
let H be the set of subgraphs of the vertex sets in U
read H into main memory to construct
the induced subgraph GW by W ;
that contain any vertex in W ;
3.
4.
5.
6. deposit any vertex connected to a vertex in Ψk (kl ≤ k ≤ ku);
7.
invoke ComputeCore(GW , kl, ku);
refine ψ(v) for any vertex v currently in main memory;
remove all vertices in Ψk (kl ≤ k ≤ ku) and their edges
l in Ψk
k
from the subgraphs in H, and merge any of
these subgraphs if their combined size is less than B;
compute k
write the subgraphs in H that consist of no vertex in Ψk
k
to the disk, unless all the subgraphs can now be kept
in main memory;
invoke EmCoreStep(k
u = (kl − 1);
by setting k
l, k
u);
u
l
u
l
if(kl > 2)
8.
9.
10.
11.
12. else
13.
add all the remaining vertices to Ψ1 and output Ψ1;
ku
ku
Intuitively, kl = min{k : k ∈ K} means that we load
as many parts of the graph as possible into main memory
(bounded by the memory size M ) for core decomposition at
each recursive step. However, in the rare case when |K| = ∅
ku
ku are in more than b vertex sets in U ),
(i.e., the vertices in Ψ
we simply assign kl = ku and load b vertex sets in {U : U ∈
U , U ∩ Ψ
= ∅} each time into main memory to construct
the subgraph for core decomposition.
To obtain kl, we keep a heap in main memory to maintain
the highest ψ in each vertex set U ∈ U . The highest ψ among
the vertices in each U ∈ U is inserted into the heap before
U is written to disk during graph partitioning. Given b and
ku, we can easily determine the corresponding value of kl by
querying the heap.
where kl ≤ k ≤ ku, as outlined in Procedure 6.
We now discuss how EmCoreStep computes the k-classes,
ku
kl . In other words, W is the set of candidate
vertices whose exact core number is to be computed at this
recursive step. The first step is to construct
the induced
subgraph GW of G by W . We construct GW from the set
of subgraphs of the vertex sets in U that contain a vertex in
W , since all vertices in W and their inter-connection edges
are contained in these subgraphs. During the construction, we
can discard those vertices that are not in W , together with
their edges.
Let W = Ψ
After GW is constructed, we invoke ComputeCore, as
57
while (∃v ∈ W , (d(v, GW ) + deposit (v, ku)) < k)
remove v and all edges incident to v from GW ;
Procedure 7 ComputeCore(GW , kl, ku)
initialize ψ(v) to be −1 for each v ∈ W ;
1.
2. k ← kl;
3. while (k ≤ ku)
4.
5.
6.
7.
8.
9.
10.
update ψ(v) to be k for each remaining vertex v in GW ;
k ← k + 1;
for each k, where kl ≤ k ≤ ku, do
Ψk ← {v : v ∈ W, ψ(v) = k};
output Ψk;
shown in Procedure 7, to compute the k-classes Ψk from GW ,
where kl ≤ k ≤ ku. However, an immediate question raised
here is: since GW is the induced subgraph, is the core number
computed from GW correct? We will prove the correctness of
our algorithm in Theorem 1, which will also become clearer
as we continue to unwrap the details of our algorithm.
One key concept to ensure the correctness of our algorithm
is the concept of deposit assigned to each vertex (Line 6 of
Procedure 6). We define the deposit of a vertex as follows.
Definition 2 (DEPOSIT): Given a vertex v ∈ V and an inte-
ger ku, where ku > ψ(v), the deposit of v with respect to ku
is defined as deposit (v, ku) = |{u : u ∈ nb(v), ψ(u) > ku}|.
-core are removed, where k
The concept of deposit is used to ensure that the number of
edge connections to a vertex v in the k-core, where kl ≤ k ≤
ku, can be correctly counted even if the vertices and edges that
belong to all k
> ku. After we
compute the k-classes, where k ≥ kl, we can deposit a “coin”
to each neighbor of a vertex in the k-classes if the neighbor’s
core number has not been determined (i.e., the neighbor’s ψ
is less than kl and its core number will be computed in a later
recursive step). This “coin” accounts for an edge connection
to this neighbor vertex in the computation of its core number.
Using the deposit, we can safely remove all the vertices
and their edges after the core number of these vertices in the
current recursive step is computed (Line 7), without worrying
about the correctness of the core decomposition in the later
recursive steps. In this way, we can save considerable disk I/O
costs for reading and writing these large-core-number vertices
and their edges again and again in computing the k-classes
with a smaller core number.
Procedure 7 computes the correct core number of the
vertices in the induced subgraph GW as follows. The algorithm
iteratively deletes the lowest degree vertex and its edges as
in an in-memory algorithm. However, in selecting the lowest
degree vertex, the algorithm considers the sum of the local
degree in GW and the deposit of the vertex. We remark that
the initial value of the deposit of all vertices with respect to
ku = ∞ is set to 0. We prove the correctness of Procedure 7
in Lemma 3.
After we compute Ψk, where kl ≤ k ≤ ku, we return to
Lines 5-6 of Procedure 6. We need to refine the upper bound
on the core number of remaining vertices as well as updating
their deposit. We first refine ψ(v) for any vertex v that is
currently in main memory but does not have its core number
computed. The refinement of ψ(v) involves two steps: (1) We
first set ψ(v) to be (kl − 1) if currently ψ(v) ≥ kl. This
is because if ψ(v) ≥ kl, then v must be processed in the
current recursive step. However, since v’s core number is not
determined, v is not in any Ψk, where kl ≤ k ≤ ku, which
implies that ψ(v) ≤ (kl − 1). (2) We then refine all v by
Procedure 4, since now we have the exact core number of
more vertices computed, which can be used to tighten the ψ
value of the remaining vertices.
After the refinement, we have ψ(v) < kl ≤ ψ(u) for all
v whose core number has not been determined and any u in
Ψk, where k ≥ kl. Then, Line 6 updates the deposit of a
u = (kl − 1). The deposit (v, kl − 1)
vertex v with respect to k
can be calculated as (deposit (v, ku) + |{u : u ∈ nb(v), u ∈
kl≤k≤ku Ψk}|), where the first part is the number of edge
connections to v that were removed at previous recursive steps,
while the second part is the number of edge connections to v
that are to be removed at the current recursive step. Therefore,
deposit (v, kl − 1) ensures the correct computation of ψ(v) at
later recursive steps.
After the update of the deposit, we can safely remove all
vertices in Ψk, where kl ≤ k ≤ ku, and their edges from the
subgraphs in H (Line 7 of Procedure 6). Then, we merge any
of these subgraphs if their combined size is less than B, so
as to further reduce the disk I/O cost. Furthermore, we keep
those subgraphs that will be used in the next recursive step in
main memory and only write the rest to disk, or keep all the
subgraphs in main memory if they are now small enough to
be all kept in main memory (Line 10 of Procedure 6).
The recursive step goes on to process the next range of k,
l, kl− 1], until kl = 2 (Lines 8-11 of Procedure 6). Note that
[k
the upper bound of the range, k
u, for the next recursive step
now changes to (kl − 1). This is because, after the refinement
in Line 5 in the current recursive step, we have ψ(v) ≤ (kl−1)
for any vertex v whose core number has not been determined.
When kl = 2, the remaining vertices must be all in the 1-class,
Ψ1 (Lines 12-13).
There is still a small problem: are there any vertices in the
0-class? The case of k = 0, i.e., the 0-class, means that no
vertex in this class is connected to any other vertices. Thus,
the 0-class contains only isolated vertices in the graph, which
can be obtained trivially by one scan of the graph. To do this,
we add a check in Algorithm 2: as we scan the input graph,
if a vertex has a degree of 0, we output it as a 0-class vertex.
The following example explains how the top-down core
decomposition is processed.
Example 4: Let us consider the three subgraphs given in
Figure 2 and suppose b = 2. Then, we obtain kl = 3 and W =
∞
3 = {g, h, i, k, l, n}. We load the subgraphs in Figures 2(b)
Ψ
and 2(c) into main memory to construct the induced subgraph
by W , GW , which is shown in Figure 3(a), where the number
next to each vertex v is the current ψ(v). It is then easy to
obtain the 3-class, {h, i, k, l, n}, by Procedure 7.
3
3
3
3
4
4
2
2
2
2
2
Fig. 3.
Induced Subgraph GW
After computing the 3-class, we can refine ψ(g) from 3 to
2. We also update the deposit of any vertex that is connected
u = (kl − 1) = 2, as shown in
to a vertex in the 3-class for k
the following table.
TABLE II
deposit (v, 2) WHERE v ∈ {e, f, g, j, m}
j m
v
deposit (v, 2)
1
2
f
1
e
2
g
2
Next we compute for k
Now we can remove all the edges connected to any vertex
in {h, i, k, l, n}. After removing the edges, we only have 1
edge (d, g) and 3 isolated vertices e, f and j in Figure 2(b),
and only 2 isolated vertices g and m in Figure 2(c).
u = 2 and we load the subgraph in
Figure 2(a) (and also Figure 2(c) which is still in main mem-
ory) to construct the induced subgraph by W = {d, e, f, g, m},
as shown in Figure 3(b). By using the deposit, it is easy to
obtain the 2-class, {d, e, f, g, m}, by Procedure 7.
{a, b, c, j}, belong to the 1-class.
the remaining vertices,
2
After computing the 2-class, all
D. Analysis of Algorithm
In this subsection, we provide the theoretical analysis re-
garding the correctness as well as the complexity of our
algorithm EMcore.
ku
LEMMA 3: Given W = Ψ
kl and GW , for any v ∈ W ,
if ψ(v) ≥ kl, then ComputeCore (i.e., Procedure 7) correctly
computes ψ(v).
Proof: If ψ(v) ≥ kl, to compute ψ(v) we only need the
kl-core of G, Ckl. In GW , the largest core number that a vertex
can have is ku. Thus, in computing ψ(v) using GW instead
of Ckl, we are missing the edge set {(u, v) : v ∈ W, ψ(u) >
ku, (u, v) ∈ E}, where kl ≤ ψ(v) ≤ ku.
Since our algorithm follows a top-down approach, ∀v ∈ W ,
if ∃u ∈ nb(v), ψ(u) > ku, then ψ(u) must have already been
computed in some previous recursive step and deposited to
deposit (v, ku). Thus, we have (d(v, GW ) + deposit (v, ku)) =
d(v, Ckl ) for all v ∈ W . Since ComputeCore removes a vertex
v only if (d(v, GW ) + deposit (v, ku)) < k, or equivalently,
d(v, Ckl ) < k, it correctly computes ψ(v) for all v ∈ W ,
ψ(v) ≥ kl.
Using Lemma 3, we prove the correctness of our main
algorithm EMcore.
58