Heterogeneous Network Representation Learning:
Survey, Benchmark, Evaluation, and Beyond
University of Illinois, Urbana
University of Illinois, Urbana
University of Illinois, Urbana
Carl Yang*
Champaign
201 N Goodwin Ave
Urbana, IL, USA
jiyang3@illinois.edu
Yuxin Xiao*
Champaign
201 N Goodwin Ave
Urbana, IL, USA
yuxinx2@illinois.edu
Yu Zhang*
Champaign
201 N Goodwin Ave
Urbana, IL, USA
yuz9@illinois.edu
0
2
0
2
r
p
A
1
]
I
S
.
s
c
[
1
v
6
1
2
0
0
.
4
0
0
2
:
v
i
X
r
a
University of California, Los
University of Illinois, Urbana
Yizhou Sun
Angeles
404 Westwood Plaza
Los Angeles, CA, USA
yzsun@cs.ucla.edu
Jiawei Han
Champaign
201 N Goodwin Ave
Urbana, IL, USA
hanj@illinois.edu
ABSTRACT
Since real-world objects and their interactions are often multi-
modal and multi-typed, heterogeneous networks have been
widely used as a more powerful, realistic, and generic super-
class of traditional homogeneous networks (graphs). Mean-
while, representation learning (a.k.a. embedding) has re-
cently been intensively studied and shown effective for var-
ious network mining and analytical tasks. Since there has
already been a broad body of heterogeneous network embed-
ding (HNE) algorithms but no dedicated survey, as the first
contribution of this work, we pioneer in providing a uni-
fied paradigm for the systematic categorization and anal-
ysis over the merits of various existing HNE algorithms.
Moreover, existing HNE algorithms, though mostly claimed
generic, are often evaluated on different datasets. Under-
standable due to the natural application favor of HNE, such
indirect comparisons largely hinder the proper attribution
of improved task performance towards effective data pre-
processing and novel technical design, especially consider-
ing the various ways possible to construct a heterogeneous
network from real-world application data. Therefore, as the
second contribution, we create four benchmark datasets with
various properties regarding scale, structure, attribute/label
availability, and etc. from different sources, towards the com-
prehensive evaluation of HNE algorithms. As the third
contribution, we carefully refactor and amend the imple-
mentations of and create friendly interfaces for ten popu-
lar HNE algorithms, and provide all-around comparisons
among them over multiple tasks and experimental settings.
∗Three authors contribute equally.
licensed under
This work is
the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 13, No. xxx
ISSN 2150-8097.
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
By putting all existing HNE algorithms under a general
and complete paradigm, we aim to provide a universal refer-
ence and guideline for the understanding and development
of HNE algorithms. Meanwhile, by open-sourcing all data
and code, we envision to serve the community with an easy-
to-use benchmark platform to test and compare the perfor-
mance of existing and future HNE algorithms.1
PVLDB Reference Format:
Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, Jiawei Han. Het-
erogeneous Network Representation Learning: Survey, Bench-
mark, Evaluation, and Beyond. PVLDB, 13(xxx): xxxx-yyyy,
2020.
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
Keywords
heterogeneous network embedding, survey, benchmark
1.
INTRODUCTION
Networks and graphs constitute a canonical and ubiqui-
tous paradigm for the modeling of interactive objects, which
has drawn significant research attention from various scien-
tific domains [59, 30, 24, 3, 89, 87]. However, real-world ob-
jects and interactions are often multi-modal and multi-typed
(e.g., authors, papers, venues and terms in a publication net-
work [69, 65]; users, places, categories and GPS-coordinates
in a location-based social network [101, 91, 94]; and genes,
proteins, diseases and species in a biomedical network [38,
14]). To capture and exploit such node and link heterogene-
ity, heterogeneous networks have been proposed and widely
used in many real-world network mining scenarios, such as
meta-path based similarity search [70, 64, 92], node classifi-
cation and clustering [18, 20, 11], knowledge base completion
[68, 48, 103], and recommendations [23, 106, 31].
In the meantime, current research on graphs has largely
focused on representation learning (embedding), especially
following the pioneer of neural network based algorithms
that demonstrate revealing empirical evidence towards un-
precedentedly effective yet efficient graph mining [25, 4, 13].
They aim to convert graph data (e.g., nodes [49, 72, 26, 77,
1https://github.com/yangji9181/HNE
1
(and likely future models), we propose a generic objective
function of network smoothness, and reformulate all existing
models into this uniform paradigm while highlighting their
invididual novel contributions (Section 3). We envision this
paradigm to be helpful in guiding the development of future
novel HNE algorithms, and in the meantime facilitate their
conceptual contrast towards existing ones.
As the second contribution, we deliberately prepare four
benchmark heterogeneous network datasets through exhaus-
tive data collection, cleaning, analysis and curation (Section
4). The datasets we come up with cover a wide spectrum
of application domains (i.e., publication, recommendation,
knowledge base, and biomedicine), which have various prop-
erties regarding scale, structure, attribute/label availability,
etc. This diverse set of data, together with a series of differ-
ent network mining tasks and evaluation metrics, constitute
a systematic and comprehensive benchmark resource for fu-
ture HNE algorithms.
As the third contribution, many existing HNE algorithms
(including some very popular ones) either do not have a flex-
ible implementation (e.g., hard-coded node and edge types,
fixed set of meta-paths, etc.), or do not scale to larger net-
works (e.g., high memory requirement during training), which
adds much burden to novel research (i.e., requiring much en-
gineering effort in correct reimplementation). To this end,
we select 10 popular HNE algorithms, where we carefully
refactor and scale up the original authors’ hustled imple-
mentations and apply additional interfaces for plug-in input
of our prepared datasets (Section 5). Based on these easy-
to-use and efficient implementations, we then conduct all-
around empirical evaluations of the algorithms, and report
their benchmark performances. The empirical results, while
providing much insight into the merits of different models
that are consistent with the conceptual analysis in Section
3, also serve as the example utilization of our benchmark
platform that can be followed by future studies on HNE.
Note that, although there have been several attempts to
survey or benchmark heterogeneous network models [69, 65,
81] and homogeneous graph embedding [25, 4, 13, 86, 19],
we make the first research effort that dedicatedly lands on
the intersection. We advocate that our elaborated study
focusing on HNE is unique, timely and necessary. Firstly,
as we will cover in this work, there has been a significant
amount of research on the particular problem of HNE es-
pecially in the very recent several years, and most of them
scatter across different domains, lacking proper connections
and comparisons. Secondly, none of the existing surveys has
proposed a uniform mathematically complete paradigm for
conceptual analysis of all HNE models. Thirdly, existing
surveys mostly do not provide systematic and comprehen-
sive benchmark evaluation results, nor do they come with
benchmark datasets and open-source baselines to facilitate
future algorithm development.
The rest of this paper is organized as follows. Section 2
first introduces our proposed generic HNE paradigm. Sub-
sequently, representative models in our survey are conceptu-
ally categorized and analyzed in Section 3. We then present
in Section 4 our prepared benchmark datasets with insight-
ful analysis. In Section 5, we provide systematic and com-
prehensive empirical study over 10 popular HNE algorithms
to benchmark the current state-of-the-art of HNE. Section
6 concludes the paper with visions towards future usage of
our platform and research on HNE.
(a) Plain heterogeneous net.
(b) Labeled attributed net.
Figure 1: Different heterogeneous networks con-
structed from the same real-world application data.
37, 28, 9, 75], links [107, 1, 50, 96], and subgraphs [47, 93, 97,
45]) into low dimensional distributed vectors in the embed-
ding space where the graph topological information (e.g.,
higher-order proximity [5, 76, 105, 34] and structure [55,
102, 42, 17]) is preserved. Such embedding vectors are then
directly executable by various downstream machine learning
algorithms [58, 39, 10].
Right on the intersection of heterogeneous networks and
graph embedding, heterogeneous network embedding (HNE)
recently has also received much research attention [8, 85,
108, 16, 66, 67, 27, 22, 90, 35, 104, 57, 52, 99, 7, 98, 32,
83, 95, 82, 41]. Due to the application favor of HNE, many
algorithms have been separately developed in different ap-
plication domains such as search and recommendations [23,
63, 6, 89]. Moreover, as knowledge bases (KBs) also fall un-
der the general umbrella of heterogeneous networks, many
KB embedding algorithms can be compared with the HNE
ones [81, 3, 40, 68, 88, 15, 48, 79, 60].
Unfortunately, various HNE algorithms are developed in
quite disparate communities across academia and industry.
They have never been systematically and comprehensively
analyzed either in concepts or through experiments. In fact,
due to the lack of benchmark platforms (with ready-to-use
datasets and baselines), researchers often tend to construct
their own datasets and reimplement a few most popular
(sometimes outdated) algorithms for comparison, which ren-
ders fair performance evaluation and clear improvement at-
tribution extremely hard, if not impossible.
Simply consider the toy examples of a publication dataset
in Figure 1.2 Earlier HNE algorithms like metapath2vec
[16] were developed on the heterogeneous network with node
types of authors, papers and venues as in (a). However, one
can enrich papers with a large number of terms and topics
as additional nodes as in (b), which makes the random-walk
based shallow embedding algorithms rather ineffective, but
favors neighborhood aggregation based deep graph neural
networks like R-GCN [57]. Moreover, one can further in-
clude node attributes like term embedding and labels like
research fields and make them only available to the semi-
supervised inductive learning algorithms, which may intro-
duce even more bias [104, 82, 33, 54]. Eventually, it is often
hard to clearly attribute performance gains between techni-
cal novelty and data tweaking.
In this work, we first formulate a unified yet flexible math-
ematical paradigm that generalizes all HNE algorithms, eas-
ing the understanding of the critical merits of each model
(Section 2). Particularly, based on a uniform taxonomy
that clearly categorizes and summarizes the existing models
2https://dblp.uni-trier.de/
2
2. GENERIC PARADIGM
2.1 Problem Definitions
We first give formal definitions of concepts closely related
to HNE, starting from network embedding.
Definition 2.1. Network embedding. For a given net-
work G = {V, E}, where V is the set of nodes (vertices) and
E is the set of links (edges), a network embedding is a
mapping function Φ : V → R|V |×d, where d |V |. This
mapping Φ defines the latent representation (a.k.a. embed-
ding) of each node v ∈ V , which captures network topological
information in E.
In most cases, network proximity is the major topolog-
ical information to be captured. For example, DeepWalk
[49] captures the random-walk based node proximity and
illustrates the 2-dim node representations learned on the fa-
mous Zachary’s Karate network of small groups, where a
clear correspondence between the node position in the in-
put graph and learned embedding space can be observed.
Various follow-up works have improved or extended Deep-
Walk, while a complete coverage of them is beyond the scope
of this work. In this work, we focus on the embedding of
heterogeneous networks.
Definition 2.2. Heterogeneous network. A heterogeneous
network H = {V, E, φ, ψ} is a network with multiple types
of nodes and links. Particularly, within H, each node vi ∈ V
is associated with a node type φ(vi), and each link eij ∈ E
is associated with a link type ψ(eij). It is worth noting that
the type of a link eij automatically defines the types of nodes
vi and vj on its two ends.
Heterogeneous networks have been intensively studied due
to its power of accommodating multi-modal multi-typed in-
terconnected data. Besides the classic example of DBLP
data used in most existing works as well as Figure 1, con-
sider a different yet illustrative example from NYTimes in
Figure 2.3 Nodes in this heterogeneous network include news
articles, categories, phrases, locations, and datetimes. To il-
lustrate the power of heterogeneous networks, we introduce
the concept of meta-path, which has been leveraged by most
existing works on heterogeneous network modeling [69, 65].
Definition 2.3. Meta-path. A meta-path is a path de-
l1−→
lm−−→ om+1, where o and l are node types and link
fined on the network schema denoted in the form of o1
o2
types, respectively.
l2−→ ···
Each meta-path captures the proximity among the nodes
on its two ends from a particular semantic perspective. Con-
tinue with our example of heterogeneous network from news
−−−−−→ cat-
data in Figure 2. The meta-path of article
egory includes−−−−→ article carries different semantics from article
−−−−−→ location
−−−−−−−→ article. Thus, the leverage of dif-
ferent meta-paths allows heterogeneous network models to
compute the multi-modal multi-typed node proximity and
relation, which has been shown beneficial to many real-world
network mining applications [64, 36, 31].
mentioned by
belongs to
mentions
Now we define the main problem of focus in this work,
heterogeneous network embedding (HNE), which lies in the
intersection between Def 2.1 and Def 2.2.
3https://www.nytimes.com/
3
Figure 2: Toy example of a heterogeneous network
constructed from the news data.
Definition 2.4. Heterogeneous network embedding. For
a given heterogeneous network H, a heterogeneous net-
work embedding is a set of mapping functions {Φk
:
Vk → R|Vk|×d}K
k=1, where K is the number of node types,
∀vi ∈ Vk, φ(vi) = k, d |V |. Each mapping Φk defines
the latent representation (a.k.a. embedding) of all nodes of
type k, which captures the network topological information
regarding the heterogeneous links in E.
Compared with homogeneous networks, the definition of
topological information in heterogeneous networks is even
more diverse. As we will show in Section 3, the major
distinctions among different HNE algorithms mostly lie in
their different ways of capturing such topological informa-
tion. Particularly, the leverage of meta-paths as in Def 2.3
often plays an essential role, since many popular HNE al-
gorithms exactly aim to model the different proximity indi-
cated by meta-paths [16, 22, 83, 7, 99, 82, 104, 63].
2.2 Proposed Paradigm
In this work, we stress that one of the most important
principles underlying HNE (as well as most other scenarios
of network modeling and mining) is homophily [43]. Partic-
ularly, in the network embedding setting, homophily can be
translated as ‘nodes close on a network should have similar
embeddings’, which matches the requirement of Def 2.1. In
fact, we further find intrinsic connections between the well-
perceived homophily principle and widely-used smoothness
enforcement technique on networks, which leads to a generic
mathematical paradigm covering most existing and likely
many future HNE algorithms.
Based on earlier well-established concepts underlying net-
work modeling and embedding learning [73, 56, 2, 110, 111],
we introduce the following key objective function of network
smoothness enforcement as follows
u,v∈V
J =
wuvd(eu, ev) + JR,
(1)
where eu = Φ(u) and ev = Φ(v) are the node embedding
vectors to be learned. wuv is the proximity weight, d(·,·) is
the embedding distance function, and JR denotes possible
additional objectives such as regularizers, all three of which
can be defined and implemented differently by the particular
HNE algorithms.
KincadeWildfireDisasterCaliforniaOct 2019PHRASEPropertyDamageFloridaHurricaneFacebookAnnounces LibraTechnologyOct 2018PHRASEFloridaJP MorganNew Tech OfficeDigitalPayment
3. ALGORITHM TAXONOMY
In this section, we find a universal taxonomy for existing
HNE algorithms with three categories based on their com-
mon objectives, and elaborate in detail how they all fit into
our paradigm of Eq. (1). The main challenge of instantiat-
ing Eq. (1) on heterogeneous networks is the consideration of
complex interactions regarding multi-typed links and higher-
order meta-paths. In fact, our Eq. (1) also readily general-
izes to homogeneous networks, though that is beyond the
scope of this work.
3.1 Proximity-Preserving Methods
As mentioned above, one basic goal of network embed-
ding is to capture network topological information. This
can be achieved by preserving different types of proximity
among nodes. There are two major categories of proximity-
preserving methods in HNE: random walk-based approaches
(inspired by DeepWalk [49]) and first/second-order proximity-
based ones (inspired by LINE [72]).
3.1.1 Random Walk-Based
metapath2vec [16]. Following homogeneous network em-
bedding [49, 26], metapath2vec utilizes the node paths tra-
versed by meta-path guided random walks to model the con-
text of a node regarding heterogeneous semantics. Formally,
l2−→ ···
lm−1−−−→ om, the
given a meta-path M = o1
transition probability at step i is defined as
l1−→ o2
1
|Nli
0
p(vi+1|vi,M) =
(vi)| φ(vi+1) = oi+1, ψ(vi, vi+1) = li
otherwise
(2)
where Nl(v) = {u|ψ(u, v) = l} denotes the neighbors of v
associated with edge type l. Assume P = {P1, ...,PM} is
the set of generated random walk sequences. The objective
of metapath2vec is
v∈V
u∈C(v)
log
J =
exp(eT
u ev)
u∈V exp(eT
u ev)
,
(3)
where C(v) is the contexts (i.e., skip-grams) of v in P. For
example, if P1 = v1v2v3v4v5... and the context window size
is 2, then {v1, v2, v4, v5} ⊆ C(v3). Let wuv be the number of
times that u ∈ C(v) is in P. Eq. (3) can be rewritten as
J =
wuv log
exp(eT
u ev)
u∈V exp(eT
u ev)
.
u,v∈V
p(M|u, v) = σ
The denominator in this objective requires summing over all
nodes and is computationally expensive. In actual compu-
tation, it is approximated using negative sampling [44]. In
this paper, we still analyze the original objective.
HIN2Vec [22]. HIN2Vec considers the probability that
there is a meta-path M between nodes u and v. Specifically,
where 1 is an all-ones vector; is the Hadamard product;
f01 is a normalization function. Here eu = W T
X u, ev =
W T
dings of u, v and M, respectively. Let AM = diag(eM1, ...,
eMd). We have
1T
R m
W T
W T
R m can be viewed as the embed-
1Teu ev eM
p(M|u, v) = σ
Y v and eM = f01
X u W T
Y v f01
= σ(eT
u AMev).
W T
,
4
σ is the sigmoid function, so we have
1 − p(M|u, v) = 1 − σ(eT
u AMev) = σ(−eT
u AMev).
HIN2Vec generates positive tuples (u, v,M) (i.e., u connects
with v via meta-path M) using homogeneous random walks
[49] regardless of node/link types. For each positive tuple
(u, v,M), it generates several negative tuples by replacing
u with a random node u. Its objective is
J0 =
log(1 − p(M|u, v))
log p(M|u, v) +
log σ(eT
u AMev) +
log σ(−eT
u AMev)
.
(u,v,M)
=
(u,v,M)
(u,v,M)
u
This is actually the negative sampling approximation of the
following objective
M
u,v∈V
J =
M
uv log
w
exp(eT
u AMev)
u∈V exp(eT
u AMev)
,
uv is the number of path instances between u and
where wM
v following meta-path M.
SHNE [99]. SHNE improves metapath2vec by incorporat-
ing additional node information. It assumes that some types
of nodes may have additional text information (denoted as
xv). It proposes a deep encoder fENC based on GRU [12].
Namely, if v has text information, ev = fENC(xv); other-
wise, ev represents typical learnable node embeddings. The
objective of SHNE is the same as that of metapath2vec, and
it is straightforward to extend it to incorporate other types
of node information like categorical attributes, images, etc.
by leveraging domain-specific deep encoders, which is also
considered in HINE [8].
Other random walk-based approaches are summarized in
Table 1. To be specific, MRWNN [85] incorporates con-
tent priors into DeepWalk embedding; HHNE [83] extends
metapath2vec to the hyperbolic space; GHE [11] proposes a
semi-supervised meta-path weighting technique; MNE [100]
conducts random walks separately for each view in a multi-
view network; JUST [35] proposes Jump/Stay random walks
that do not rely on pre-defined meta-paths.
3.1.2 First/Second-Order Proximity-Based
PTE [71]. PTE proposes to decompose a heterogeneous
network into multiple bipartite networks, each of which de-
scribes one edge type. Its objective is the sum of log-likelihoods
over all bipartite networks:
exp(eT
u ev)
exp(eT
u ev)
u∈Vφ(u)
u ev)
exp(eT
exp(eT
u∈Vφ(u)
.
u ev)
l∈TE
J =
=
wl
uv log
u,v∈V
u,v∈V
wuv log
uv = 0); wuv =
Here TE is the set of edge types; wl
uv is the type-l edge
weight of (u, v) (if there is no edge between u and v with
type l, then wl
uv is the total edge weight
between u and v.
AspEm [66]. AspEm assumes that each heterogeneous net-
work has multiple aspects, and each aspect is defined as a
subgraph of the network schema [70]. An incompatibility
l wl
,
=
J =
measure is proposed to select appropriate aspects for em-
bedding learning. Given an aspect a, its objective is
E is the set of edge types in aspect a; Zl =
u,aev,a)
exp(eT
u∈Vφ(u)
exp(eT
uv log
u,v∈V
1
Zl
l∈T a
wl
E
u,aev,a)
where T a
u,v wl
uv
is a normalization factor; eu,a is the aspect-specific embed-
ding of u.
HEER [67]. HEER extends PTE by considering typed
closeness. Specifically, each edge type l has an embedding
µl, and its objective is
l∈TE
u,v∈V
J =
wl
uv log
exp(µT
l guv)
(u,v)∈Pl(u,v) exp(µT
l guv )
,
where guv is the edge embedding of (u, v); Pl(u, v) = {(u, v)|
ψ(u, v) = l} ∪ {(u, v)|ψ(u, v) = l}. In [67], guv has differ-
ent definitions for directed and undirected edges based on
the Hadamard product. To simplify our discussion, we as-
sume guv = eu ev. Let Al = diag(µl1, ..., µld). We have
l guv = eT
µT
u Alev, and
l∈TE
u,v∈V
J =
wl
uv log
exp(eT
u Alev)
(u,v)∈Pl(u,v) exp(eT
.
u Alev)
Other first/second-order proximity-based approaches are
summarized in Table 1. To be specific, Chang et al.
[8]
introduce a node type-aware content encoder; CMF [108]
performs joint matrix factorization over the decomposed bi-
partite networks; HEBE [27] preserves proximity regarding
each meta-graph; Phine [94] combines additional regulariza-
tion towards semi-supervised training; MVE [53] proposes
an attenion-based framework to consider multiple views of
the same nodes.
There are also studies adopting other forms of objectives
to characterize first/second-order proximities. For exam-
ple, SHINE [78] uses reconstruction loss of autoencoders in-
pired by SDNE [76]; DRL [52] proposes to learn one type
of edges at each step and uses deep reinforcement learning
approaches to select the edge type for the next step; HINSE
[90] adopts spectral embedding based on adjacency matrices
with different meta-graphs; HeGAN [32] proposes an adver-
sarial framework with a relation type-aware discriminator.
Based on the above discussions, the objective of most
proximity-preserving methods can be unified as
u,v
max J =
=
wuv log
wuvs(u, v) −
exp(s(u, v))
u exp(s(u, v))
wuv log
+ JR0
u,v
u,v
exp(s(u, v)) + JR0
(4)
u
uv or wl
Here, wuv can be specific to a meta-path M or an edge type
uv accordingly); JR0 is an algorithm-
l (denoted as wM
specific regularizer (which is summarized in Table 1); s(u, v)
is the proximity function between u and v. Note that in most
cases, we can write s(u, v) as f (eu)T f (ev). For example,
AMeu
f (eu) = eu in metapath2vec, PTE, etc.; f (eu) =
√
√
Aleu in HEER. In these cases,
in HIN2Vec; f (eu) =
J =
wuvf (eu)T f (ev) −
−
wuv · 1
2
||f (eu)||2 + ||f (ev)||2 − ||f (eu) − f (ev)||2
exp(s(u, v)) + JR0.
exp(s(u, v)) + JR0
wuv log
wuv log
u
u,v
u,v
u,v
The second step holds because ||x−y||2 = (x−y)T (x−y) =
||x||2 + ||y||2 − 2xT y. Therefore, our goal is equivalent to
u,v
min −J =
u
u,v
−
u,v
u,v
+
wuv
2
wuv
2
−JR0
d(eu,ev )
||f (eu) − f (ev)||2
||f (eu)||2 + ||f (ev)||2
JR1
exp(s(u, v))
.
u
JR2
wuv log
(5)
Here JR1 and JR2 are two regularizers. Without JR1, d(eu, ev)
can be minimized by letting ||f (ev)|| → 0; without JR2,
d(eu, ev) can be minimized by letting ev ≡ c (∀v ∈ V ).
JR in Eq. (1) is then the summation of the algorithm-
specific regularizer −JR0 and −JR1, JR2, which are com-
monly shared across most HNE algorithms in the proximity-
preserving group. We summarize the choices of wuv, d(eu, ev)
and JR0 in Table 1.
3.2 Message-Passing Methods
Each node in a network can have attribute information
represented as a feature vector xu. Message-passing meth-
ods aim to learn node embeddings eu based on xu by aggre-
gating the information from u’s neighbors. In recent studies,
Graph Neural Networks (GNNs) [37] are widely adopted to
facilitate this aggregation/message-passing process.
R-GCN [57]. R-GCN has K convolutional layers. The
initial node representation h(0)
u is just the node feature xu.
In the k-th convolutional layer, each representation vector is
updated by accumulating the vectors of neighboring nodes
through a normalized sum.
h(k+1)
u
= σ
1
|Nl(u)| W (k)
l h(k)
v + W (k)
0 h(k)
u
.
l∈TE
v∈Nl(u)
Different from regular GCNs [37], R-GCN considers edge
heterogeneity by learning multiple convolution matrices W ’s,
each corresponding to one edge type. During message pass-
ing, neighbors under the same edge type will be aggregated
and normalized first. The node embedding is the output of
the K-th layer (i.e., ev = h(K)
).
v
In unsupervised settings, message-passing approaches use
link prediction as their downstream task to train GNNs. To
be specific, the likelihood of observing edges in the hetero-
geneous network is maximized. R-GCN uses negative sam-
pling and cross-entropy loss, which is the approximation of
the following objective.
l∈TE
u,v∈V
J =
wl
uv log
u Alev)
exp(eT
u∈V exp(eT
u Alev)
,
where wl
uv = 1(u,v)∈El .
5
Algorithm
MRWNN [85]
metapath2vec [16]
SHNE [99]
HHNE [83]
GHE [11]
HIN2Vec [22]
MNE [100]
JUST [35]
HNE [8]
PTE [71]
CMF [108]
HEBE [27]
Phine [94]
MVE [53]
AspEm [66]
HEER [67]
wuv / wl
uv / wM
number of times that
u ∈ C(v) in homogeneous
uv
random walks
number of times that
u ∈ C(v) in heterogeneous
random walks following
meta-path M (Eq. (2))
number of meta-path
instances between u
and v following M
number of times that
u ∈ C(v) in homogeneous
random walks in (V, El)
number of times that
u ∈ C(v) in Jump/Stay
random walks
1(u,v)∈E
edge weight of (u, v)
edge weight of
hyperedge (u, C) [27]
number of times that
u and v co-occur in a
semantic pattern
edge weight of (u, v)
with type l
d(eu, ev)
||eu − ev||2
−
JR0
v ||ev − fENC(Xv)||2
Applications
image retrieval
||eu − ev||2,
eu = fENC(xu)
dD(eu, ev)
||eu − ev||2
AM(eu − ev)||2
||eu − ev,l||2
||√
||eu − ev||2
||eu − ev||2,
eu = Aφ(u)xu
||eu − ev||2
||eu − eC||2,
eC =
v∈C ev/|C|
||eu − ev||2
||eu,a − ev,a||2, a: aspect
||eu,l − ev||2
||√
Al(eu − ev)||2
classification, clustering,
link prediction
and recommendation
author identification
classification, clustering,
link prediction
and recommendation
text classification
and image retrieval
text classification
relatedness measurement
of Wikipedia entities
event embedding
user rating prediction
classification and
link prediction
aspect mining
relation prediction in
knowledge graphs
N/A
−
−
o∈TV
||Ao||2
F
N/A
v ||ev||2
N/A
supervised loss |ˆy − y|2
−
l λv,l||ev,l − ev||2
v
N/A
Table 1: A summary of proximity-preserving based HNE algorithms. (Additional notations: fENC: a function to
encode text/image/attribute information; dD: distance between two points in the Poincar´e ball.)
HAN [82].
Instead of one-hop neighbors, HAN utilizes
meta-paths to model higher-order proximity. Given a meta-
path M, the representation of node u is aggregated from
its meta-path based neighbors NM(u) = {u} ∪ {v|v con-
nects with u via meta-path M}. HAN proposes an attention
mechanism to learn the weights of different neighbors:
M
uv =
α
v])
expσ(aTM[x
v∈NM(u) expσ(aTM[x
u||x
M
uvx
v
α
,
M
u = σ
h
v ]) ,
u||x
v∈NM(u)
where aM is the node-level attention vector of M; x
u =
Mφ(u)xu is the projected feature vector of node u; || is the
concatenation operator.
Given meta-path specific embedding hM
u , HAN uses a
semantic-level attention to weigh different meta-paths:
exp 1|V |
M exp 1|V |
v + b)
v + b) ,
v∈V qT tanh(W hM
v∈V qT tanh(W hM
βMh
M
u ,
βM =
eu =
M
where q is the semantic-level attention vector.
Very recently, [54] proposed HDGI to improve unsuper-
vised training based on HAN, and [33] proposed HGT to
enable temporal encoding for dynamic networks and mini-
batch sampling for scalable training. Since HAN is mainly
designed for semi-supervised node classification, when we
learn node embeddings in unsupervised settings (like many
proximity-preserving methods), we adopt the widely used
link prediction loss in R-GCN and GraphSAGE [28].
6
HetGNN [98]. HetGNN assumes that each node is asso-
ciated with different features (e.g., categorical, textual and
visual). It first encodes the content of each node to a vector
hs
u, and then adopts a node type-aware aggregation function
to collect information from neighbors:
u = fAGG({hs
v|v ∈ NRWR(u), φ(v) = o}),
hs
u = fENC(xu), ho
where NRWR(v) is the neighborhood of v defined through
random walk with restart [74]. Then HetGNN uses attention
over neighborhood node types to get the final embedding,
which share the same spirits as HAN:
u]))
u ||hs
o∈TV ∪{s} exp(LeakyReLU(aT [ho
exp(LeakyReLU(aT [ho
u||hs
,
u]))
αo
u =
eu =
uho
αo
u.
o∈TV ∪{s}
The objective of HetGNN is the same as that of PTE.
There are some other message-passing approaches. For ex-
ample, HEP [109] aggregates u’s representation from No(u)
(i.e., the node type-aware neighborhood) with an application
of e-commerce user alignment; GATNE [7] aggregates u’s
representation from Nl(u) (i.e., the edge type-aware neigh-
borhood) and is applicable for both transductive and induc-
tive network embedding settings.
The objective of message-passing approaches mentioned
above can also be written as Eq. (4), where s(u, v) is a func-
tion of eu and ev. The only difference is that eu here is
aggregated from xv using GNNs. Following the derivation
of proximity-preserving approaches, if we still write JR in
Eq. (1) as the summation of −JR0 , −JR1 and JR2 , we can
get the exactly same JR1 and JR2 as in Eq. (5).
Algorithm
wuv / wl
uv
R-GCN [57]
HEP [109]
HAN [82]
HetGNN [98]
GATNE [7]
1(u,v)∈El
1(u,v)∈E
edge weight
of (u, v)
number of times
that u ∈ C(v) in
random walks
in (V, El)
d(eu, ev)
Al(eu − ev)||2
A(eu − ev)||2
||√
||√
||eu − ev||2
||eu,l − ev||2
h(k+1)
u
= σ
hu = σ
Wφ(u)
1
l∈TE
r h(k)
v∈Nl(u)
0 h(k)
u
v + W (k)
Aggregation Function
|Nl(u)| W (k)
h(0)
u = xu, eu = h(K)
o∈TV
eu =M βMσ
{fENC(xv)|v ∈ NRWR(u), φ(v) = o}
v∈NM(u) αM
v,l |v ∈ Nl(u)}
{h(k)
l∈TE
v∈No(u) αuvev
eu,l = αlW T
l
uvMφ(u)xu
u,l = fAGG
u,l = xu
au,l + bu
, h(0)
αo
ufAGG
+ bφ(u)
h(K)
u,l
u
o∈Tv
h(k+1)
eu =
Applications
entity classification
and KB completion
user alignment
node classification,
node clustering,
link prediction and
recommendation
Table 2: A summary of message-passing based HNE algorithms. (Additional notations: Nl(u): neighbors of u
with edge type l; No(u): neighbors of u with node type o; fAGG: a function to aggregate information from neighbors. For
HAN, we assume it adopts the unsupervised loss in GraphSAGE [28]. For GATNE, we show the transductive version.)
tional reconstruction loss JR0 =
Within this group of algorithms, only HEP has an addi-
v ||ev − hv||2, while all
other algorithms have JR0 = 0. We summarize the choices
of wuv, d(eu, ev) and the aggregation function in Table 2.
3.3 Relation-Learning Methods
Each edge in a heterogeneous network can be viewed as
a triplet (u, l, v) composed of two nodes u, v ∈ V and an
edge type l ∈ TE (i.e., entities and relations, in the termi-
nology of KG). The goal of relation-learning methods is to
learn a scoring function sl(u, v) which evaluates an arbitrary
triplet and outputs a scalar to measure the acceptability of
this triplet. This idea is widely adopted in KB embedding.
Since there are surveys of KB embedding algorithms already
[81], we only cover the most popular approaches here and
highlight their connections to HNE.
TransE [3]. TransE assumes that the relation induced by l-
labeled edges corresponds to a translation of the embedding
(i.e., eu + el ≈ ev) when (u, l, v) holds. Therefore, the
scoring function of TransE is defined as sl(u, v) = −||eu +
el − ev||p, where p = 1 or 2. The objective is to minimize a
margin-based ranking loss.
J =
max(0, γ − sl(u, v) + sl(u
)),
, v
(u,l,v)∈T
(u,l,v)∈T
(u,l,v)
where T is the set of positive triplets (i.e., edges); T
(u,l,v)
is the set of corrupted triplets, which are constructed by
replacing either u or v with an arbitrary node. Formally,
(6)
(u,l,v) = {(u
T
, l, v)|u
∈ V } ∪ {(u, l, v
)|v
∈ V }.
TransE is the most representative model using “a transla-
tional distance” to define the scoring function. Its extensions
include TransH [84], TransR [40], RHINE [41], etc.
DistMult [88]. In contrast to translational distance mod-
els [3, 84, 40], DistMult exploits a similarity-based scoring
function. Each relation is represented by a diagonal ma-
trix Al = diag(el1, ..., eld), and sl(u, v) is defined using a
bilinear function:
sl(u, v) = eT
u Alev.
Note that sl(u, v) = sl(v, u) for any u and v. Therefore,
DistMult is mainly designed for symmetric relations.
Besides DistMult, RESCAL [46] also uses a bilinear scor-
ing function, but Al is no longer restricted to be diagonal.
ConvE [15]. ConvE goes beyond simple distance or sim-
ilarity functions and proposes deep neural models to score
a triplet. The score is defined by a convolution over 2D
shaped embeddings. Formally,
sl(u, v) = f (vec(f ([Eu; Er] ∗ ω))W )ev,
where Eu and Er denote the 2D reshaping matrices of node
embedding and relation embedding, respectively; “∗” is the
convolution operator.
Other models based on deep neural scoring functions in-
clude NTN [68], CACL [48], SACN [60], NKGE [82], etc.
All relation-learning based embedding algorithms men-
tioned above adopt a margin-based ranking loss with some
regularization terms that generalizes Eq. (6):
max(0, γ − sl(u, v) + sl(u
)) +JR0. (7)
, v
In [51], Qiu et al. point out that the margin-based loss shares
a very similar form as the following negative sampling loss:
log(σ(sl(u
, v
)))
.
log(σ(sl(u, v))) − bE(u,l,v)
(u,l,v)
wl
uv
(u,l,v)
−
(u,l,v)
Following [51], if we use use negative sampling loss to rewrite
Eq. (7), we are approximately maximizing
(u,l,v)
J =
wl
uv log
exp(sl(u, v))
(u,l,v) exp(sl(u, v))
+ JR0.
For translational distance models [3, 84, 40, 41] where
sl(u, v) is described by distance, this is equivalent to
−JR0
uv ||eu + el − ev||1 or 2
wl
min −J =
(u,l,v)
d(eu,ev )
exp(sl(u, v))
JR1
+
(u,l,v)
wl
uv log
(u,l,v)
.
(8)
In this case, we can write JR as −JR0 + JR1.
For neural triplet scorers [15, 68, 48, 60], the forms of
sl(u, v) are more complicated than the inner product or dis-
tance. In these cases, since distance and proximity can be
viewed as reverse metrics, we define d(eu, ev) as C−sl(u, v),
where C is an constant upper bound of sl(·,·). Then the
derivation of the loss function is similar to that of Eq. (8),
and we will have JR = −JR0 + JR1.
uv, d(eu, ev) and JR0 in
We summarize the choices of wl
Table 3. Note that for relation learning methods, d(·,·) may
not be a distance metric. For example, d(eu, ev) = d(ev, eu)
in most translational distance models. This is intuitive be-
cause (u, l, v) and (v, l, u) often express different meanings.
7
wl
uv
1(u,v)∈El
edge weight of
(u, v) with type l
1(u,v)∈El
Algorithm
TransE [3]
TransH [84]
TransR [40]
RHINE [41]
DistMult [88]
NTN [68]
ConvE [15]
CACL [48]
SACN [60]
NKGE [79]
d(eu, ev)
||eu + el − ev||
||eu,l + el − ev,l||2,
ev,l = ev − wT
l evwl
||eu,l + el − ev,l||2,
ev,l = Alev
||eu − ev||2 if l models affiliation,
||eu + el − ev|| if l models interaction
Al(eu − ev)||2
||√
u M ev + Ml,1eu + Ml,2ev + bl)
r tanh(eT
C − σ(vec(σ([Eu; Er] ∗ ω))W )ev
v σ(wu eu + wr er + b) + b0)
C − f (vec(M (eu, el))W )ev
C − f (eT
C − eT
same as TransE or ConvE, where
eu = σ(gu) es
u + (1 − σ(gu)) en
u
JR0
l(||wl|| − 1) +
v(||ev|| − 1)
(wT
||el||2 − 2
v[||ev|| − 1]+ +
l el)2
+
v[||Alev|| − 1]+
l
v[||ev|| − 1]++
+
l[||el|| − 1]+
N/A
||Θ||2
2
N/A
||Θ||2
2
N/A
||Θ||2
2 when using the
TransE objective
Applications
KB completion and
relation extraction
from text
link prediction and
node classification
KB completion and
rule extraction
triplet classification
KB completion and
triplet classification
KB completion
Table 3: A summary of relation-learning based HNE algorithms. (Additional notations: f : a non-linear activation
function; [x]+: max{x, 0}; Θ: the set of learned parameters; Eu, Er: 2D reshaping matrices of eu and er [15]; M (eu, el): a
matrix aligning the output vectors from the convolution with all kernels [60].)
4. BENCHMARK
4.1 Dataset Preparation
4.2 Structure Analysis
Towards off-the-shelf evaluation of HNE algorithms in a
systematic and comprehensive manner,
in this work, we
spare no effort to collect, process, analyze, and publish four
real-world heterogeneous network datasets from different do-
mains, which we aim to set up as a benchmark for existing
and future HNE algorithms.
DBLP. We construct a network of authors, papers, venues,
and phrases from DBLP. Phrases are extracted by the popu-
lar AutoPhrase [61] algorithm from paper texts and further
filtered by human experts. We compute word2vec [44] on all
paper texts and aggregate the word embeddings to get 300-
dim paper and phrase features. Author and venue features
are the aggregations of their corresponding paper features.
We further manually label a relatively small portion of au-
thors into 12 research groups from four research areas by
crawling the web. Each labeled author has only one label.
Yelp. We construct a network of businesses, users, loca-
tions, and reviews from Yelp.4 Nodes are not associated
with any feature, but a large portion of businesses are la-
beled into sixteen categories. Each labeled business has one
or multiple labels.
Freebase. We construct a network of books, films, mu-
sic, sports, people, locations, organizations, and businesses
from Freebase.5 Nodes are not associated with any features,
but a large portion of books are labeled into eight genres of
literature. Each labeled book has only one label.
PubMed. We construct a network of genes, diseases, chem-
icals, and species from PubMed.6 All nodes are extracted
by AutoPhrase [61], typed by AutoNER [62], and further
filtered by human experts. We compute word2vec [44] on
all PubMed papers and aggregate the word embeddings to
get 200-dim features for all types of nodes. We further label
a relatively small portion of diseases into eight categories.
Each labeled disease has only one label.
4https://www.yelp.com/dataset/challenge
5http://www.freebase.com/
6https://www.ncbi.nlm.nih.gov/pubmed/
A summary of the statistics on the four datasets is pro-
vided in Table 4 and Figure 3. As can be observed, the
datasets have different sizes (numbers of nodes and links)
and heterogeneity (numbers and distributions of node/link
types). Moreover, due to the nature of the data sources,
DBLP and PubMed networks are attributed, whereas Yelp
and Freebase networks are abundantly labeled. A combina-
tion of these four datasets thus allows researchers to flexibly
start by testing an HNE algorithm in the most appropriate
settings, and eventually complete an all-around evaluations
over all settings.
We also provide comprehensive analysis regarding sev-
eral most widely concerned properties of heterogeneous net-
works, i.e., degree distribution (Figure 4), clustering coeffi-
cient (Figure 5), and number of frequent meta-paths (Figure
6).
In particular, degree distribution is known to signifi-
cantly influence the performance of HNE algorithms due to
the widely used node sampling process, whereas clustering
coefficient impacts HNE algorithms that utilize latent com-
munity structures. Moreover, since many HNE algorithms
rely on meta-paths, the skewer distribution of meta-paths
can bias towards algorithms using fewer meta-paths.
As we can see, the properties we concern are rather differ-
ent across the four datasets we prepare. For example, there
are tighter links and more labels in Yelp, while there are
more types of nodes and links in Freebase; compared with
nodes in Freebase and PubMed which clearly follow the long-
tail degree distribution, certain types of nodes in DBLP and
Yelp are always well connected (e.g., phrases in DBLP and
businesses in Yelp), forming more star-shaped subgraphs;
the type-wise clustering coefficients and meta-path distri-
butions are the most skewed in DBLP and most balanced
in PubMed. The set of four datasets together provide a
comprehensive benchmark towards the robustness and gen-
eralizability of various HNE algorithms (as we will also see
in Section 5.2).
4.3 Settings, Tasks, and Metrics
We mainly compare all ten algorithms under the setting
of unsupervised unattributed HNE over all datasets, where
the essential goal is to preserve different types of edges in
the heterogeneous networks. Moreover, for message-passing
algorithms that are particularly designed for attributed and
8