logo资料库

最新《异构网络表示学习》2020综述论文.pdf

第1页 / 共15页
第2页 / 共15页
第3页 / 共15页
第4页 / 共15页
第5页 / 共15页
第6页 / 共15页
第7页 / 共15页
第8页 / 共15页
资料共15页,剩余部分请下载后查看
1 Introduction
2 Generic Paradigm
2.1 Problem Definitions
2.2 Proposed Paradigm
3 Algorithm Taxonomy
3.1 Proximity-Preserving Methods
3.1.1 Random Walk-Based
3.1.2 First/Second-Order Proximity-Based
3.2 Message-Passing Methods
3.3 Relation-Learning Methods
4 Benchmark
4.1 Dataset Preparation
4.2 Structure Analysis
4.3 Settings, Tasks, and Metrics
5 Experimental Evaluations
5.1 Algorithms and Modifications
5.2 Performance Benchmarks
6 Future
7 References
Heterogeneous Network Representation Learning: Survey, Benchmark, Evaluation, and Beyond University of Illinois, Urbana University of Illinois, Urbana University of Illinois, Urbana Carl Yang* Champaign 201 N Goodwin Ave Urbana, IL, USA jiyang3@illinois.edu Yuxin Xiao* Champaign 201 N Goodwin Ave Urbana, IL, USA yuxinx2@illinois.edu Yu Zhang* Champaign 201 N Goodwin Ave Urbana, IL, USA yuz9@illinois.edu 0 2 0 2 r p A 1 ] I S . s c [ 1 v 6 1 2 0 0 . 4 0 0 2 : v i X r a University of California, Los University of Illinois, Urbana Yizhou Sun Angeles 404 Westwood Plaza Los Angeles, CA, USA yzsun@cs.ucla.edu Jiawei Han Champaign 201 N Goodwin Ave Urbana, IL, USA hanj@illinois.edu ABSTRACT Since real-world objects and their interactions are often multi- modal and multi-typed, heterogeneous networks have been widely used as a more powerful, realistic, and generic super- class of traditional homogeneous networks (graphs). Mean- while, representation learning (a.k.a. embedding) has re- cently been intensively studied and shown effective for var- ious network mining and analytical tasks. Since there has already been a broad body of heterogeneous network embed- ding (HNE) algorithms but no dedicated survey, as the first contribution of this work, we pioneer in providing a uni- fied paradigm for the systematic categorization and anal- ysis over the merits of various existing HNE algorithms. Moreover, existing HNE algorithms, though mostly claimed generic, are often evaluated on different datasets. Under- standable due to the natural application favor of HNE, such indirect comparisons largely hinder the proper attribution of improved task performance towards effective data pre- processing and novel technical design, especially consider- ing the various ways possible to construct a heterogeneous network from real-world application data. Therefore, as the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and etc. from different sources, towards the com- prehensive evaluation of HNE algorithms. As the third contribution, we carefully refactor and amend the imple- mentations of and create friendly interfaces for ten popu- lar HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings. ∗Three authors contribute equally. licensed under This work is the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 13, No. xxx ISSN 2150-8097. DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx By putting all existing HNE algorithms under a general and complete paradigm, we aim to provide a universal refer- ence and guideline for the understanding and development of HNE algorithms. Meanwhile, by open-sourcing all data and code, we envision to serve the community with an easy- to-use benchmark platform to test and compare the perfor- mance of existing and future HNE algorithms.1 PVLDB Reference Format: Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun, Jiawei Han. Het- erogeneous Network Representation Learning: Survey, Bench- mark, Evaluation, and Beyond. PVLDB, 13(xxx): xxxx-yyyy, 2020. DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx Keywords heterogeneous network embedding, survey, benchmark 1. INTRODUCTION Networks and graphs constitute a canonical and ubiqui- tous paradigm for the modeling of interactive objects, which has drawn significant research attention from various scien- tific domains [59, 30, 24, 3, 89, 87]. However, real-world ob- jects and interactions are often multi-modal and multi-typed (e.g., authors, papers, venues and terms in a publication net- work [69, 65]; users, places, categories and GPS-coordinates in a location-based social network [101, 91, 94]; and genes, proteins, diseases and species in a biomedical network [38, 14]). To capture and exploit such node and link heterogene- ity, heterogeneous networks have been proposed and widely used in many real-world network mining scenarios, such as meta-path based similarity search [70, 64, 92], node classifi- cation and clustering [18, 20, 11], knowledge base completion [68, 48, 103], and recommendations [23, 106, 31]. In the meantime, current research on graphs has largely focused on representation learning (embedding), especially following the pioneer of neural network based algorithms that demonstrate revealing empirical evidence towards un- precedentedly effective yet efficient graph mining [25, 4, 13]. They aim to convert graph data (e.g., nodes [49, 72, 26, 77, 1https://github.com/yangji9181/HNE 1
(and likely future models), we propose a generic objective function of network smoothness, and reformulate all existing models into this uniform paradigm while highlighting their invididual novel contributions (Section 3). We envision this paradigm to be helpful in guiding the development of future novel HNE algorithms, and in the meantime facilitate their conceptual contrast towards existing ones. As the second contribution, we deliberately prepare four benchmark heterogeneous network datasets through exhaus- tive data collection, cleaning, analysis and curation (Section 4). The datasets we come up with cover a wide spectrum of application domains (i.e., publication, recommendation, knowledge base, and biomedicine), which have various prop- erties regarding scale, structure, attribute/label availability, etc. This diverse set of data, together with a series of differ- ent network mining tasks and evaluation metrics, constitute a systematic and comprehensive benchmark resource for fu- ture HNE algorithms. As the third contribution, many existing HNE algorithms (including some very popular ones) either do not have a flex- ible implementation (e.g., hard-coded node and edge types, fixed set of meta-paths, etc.), or do not scale to larger net- works (e.g., high memory requirement during training), which adds much burden to novel research (i.e., requiring much en- gineering effort in correct reimplementation). To this end, we select 10 popular HNE algorithms, where we carefully refactor and scale up the original authors’ hustled imple- mentations and apply additional interfaces for plug-in input of our prepared datasets (Section 5). Based on these easy- to-use and efficient implementations, we then conduct all- around empirical evaluations of the algorithms, and report their benchmark performances. The empirical results, while providing much insight into the merits of different models that are consistent with the conceptual analysis in Section 3, also serve as the example utilization of our benchmark platform that can be followed by future studies on HNE. Note that, although there have been several attempts to survey or benchmark heterogeneous network models [69, 65, 81] and homogeneous graph embedding [25, 4, 13, 86, 19], we make the first research effort that dedicatedly lands on the intersection. We advocate that our elaborated study focusing on HNE is unique, timely and necessary. Firstly, as we will cover in this work, there has been a significant amount of research on the particular problem of HNE es- pecially in the very recent several years, and most of them scatter across different domains, lacking proper connections and comparisons. Secondly, none of the existing surveys has proposed a uniform mathematically complete paradigm for conceptual analysis of all HNE models. Thirdly, existing surveys mostly do not provide systematic and comprehen- sive benchmark evaluation results, nor do they come with benchmark datasets and open-source baselines to facilitate future algorithm development. The rest of this paper is organized as follows. Section 2 first introduces our proposed generic HNE paradigm. Sub- sequently, representative models in our survey are conceptu- ally categorized and analyzed in Section 3. We then present in Section 4 our prepared benchmark datasets with insight- ful analysis. In Section 5, we provide systematic and com- prehensive empirical study over 10 popular HNE algorithms to benchmark the current state-of-the-art of HNE. Section 6 concludes the paper with visions towards future usage of our platform and research on HNE. (a) Plain heterogeneous net. (b) Labeled attributed net. Figure 1: Different heterogeneous networks con- structed from the same real-world application data. 37, 28, 9, 75], links [107, 1, 50, 96], and subgraphs [47, 93, 97, 45]) into low dimensional distributed vectors in the embed- ding space where the graph topological information (e.g., higher-order proximity [5, 76, 105, 34] and structure [55, 102, 42, 17]) is preserved. Such embedding vectors are then directly executable by various downstream machine learning algorithms [58, 39, 10]. Right on the intersection of heterogeneous networks and graph embedding, heterogeneous network embedding (HNE) recently has also received much research attention [8, 85, 108, 16, 66, 67, 27, 22, 90, 35, 104, 57, 52, 99, 7, 98, 32, 83, 95, 82, 41]. Due to the application favor of HNE, many algorithms have been separately developed in different ap- plication domains such as search and recommendations [23, 63, 6, 89]. Moreover, as knowledge bases (KBs) also fall un- der the general umbrella of heterogeneous networks, many KB embedding algorithms can be compared with the HNE ones [81, 3, 40, 68, 88, 15, 48, 79, 60]. Unfortunately, various HNE algorithms are developed in quite disparate communities across academia and industry. They have never been systematically and comprehensively analyzed either in concepts or through experiments. In fact, due to the lack of benchmark platforms (with ready-to-use datasets and baselines), researchers often tend to construct their own datasets and reimplement a few most popular (sometimes outdated) algorithms for comparison, which ren- ders fair performance evaluation and clear improvement at- tribution extremely hard, if not impossible. Simply consider the toy examples of a publication dataset in Figure 1.2 Earlier HNE algorithms like metapath2vec [16] were developed on the heterogeneous network with node types of authors, papers and venues as in (a). However, one can enrich papers with a large number of terms and topics as additional nodes as in (b), which makes the random-walk based shallow embedding algorithms rather ineffective, but favors neighborhood aggregation based deep graph neural networks like R-GCN [57]. Moreover, one can further in- clude node attributes like term embedding and labels like research fields and make them only available to the semi- supervised inductive learning algorithms, which may intro- duce even more bias [104, 82, 33, 54]. Eventually, it is often hard to clearly attribute performance gains between techni- cal novelty and data tweaking. In this work, we first formulate a unified yet flexible math- ematical paradigm that generalizes all HNE algorithms, eas- ing the understanding of the critical merits of each model (Section 2). Particularly, based on a uniform taxonomy that clearly categorizes and summarizes the existing models 2https://dblp.uni-trier.de/ 2
2. GENERIC PARADIGM 2.1 Problem Definitions We first give formal definitions of concepts closely related to HNE, starting from network embedding. Definition 2.1. Network embedding. For a given net- work G = {V, E}, where V is the set of nodes (vertices) and E is the set of links (edges), a network embedding is a mapping function Φ : V → R|V |×d, where d |V |. This mapping Φ defines the latent representation (a.k.a. embed- ding) of each node v ∈ V , which captures network topological information in E. In most cases, network proximity is the major topolog- ical information to be captured. For example, DeepWalk [49] captures the random-walk based node proximity and illustrates the 2-dim node representations learned on the fa- mous Zachary’s Karate network of small groups, where a clear correspondence between the node position in the in- put graph and learned embedding space can be observed. Various follow-up works have improved or extended Deep- Walk, while a complete coverage of them is beyond the scope of this work. In this work, we focus on the embedding of heterogeneous networks. Definition 2.2. Heterogeneous network. A heterogeneous network H = {V, E, φ, ψ} is a network with multiple types of nodes and links. Particularly, within H, each node vi ∈ V is associated with a node type φ(vi), and each link eij ∈ E is associated with a link type ψ(eij). It is worth noting that the type of a link eij automatically defines the types of nodes vi and vj on its two ends. Heterogeneous networks have been intensively studied due to its power of accommodating multi-modal multi-typed in- terconnected data. Besides the classic example of DBLP data used in most existing works as well as Figure 1, con- sider a different yet illustrative example from NYTimes in Figure 2.3 Nodes in this heterogeneous network include news articles, categories, phrases, locations, and datetimes. To il- lustrate the power of heterogeneous networks, we introduce the concept of meta-path, which has been leveraged by most existing works on heterogeneous network modeling [69, 65]. Definition 2.3. Meta-path. A meta-path is a path de- l1−→ lm−−→ om+1, where o and l are node types and link fined on the network schema denoted in the form of o1 o2 types, respectively. l2−→ ··· Each meta-path captures the proximity among the nodes on its two ends from a particular semantic perspective. Con- tinue with our example of heterogeneous network from news −−−−−→ cat- data in Figure 2. The meta-path of article egory includes−−−−→ article carries different semantics from article −−−−−→ location −−−−−−−→ article. Thus, the leverage of dif- ferent meta-paths allows heterogeneous network models to compute the multi-modal multi-typed node proximity and relation, which has been shown beneficial to many real-world network mining applications [64, 36, 31]. mentioned by belongs to mentions Now we define the main problem of focus in this work, heterogeneous network embedding (HNE), which lies in the intersection between Def 2.1 and Def 2.2. 3https://www.nytimes.com/ 3 Figure 2: Toy example of a heterogeneous network constructed from the news data. Definition 2.4. Heterogeneous network embedding. For a given heterogeneous network H, a heterogeneous net- work embedding is a set of mapping functions {Φk : Vk → R|Vk|×d}K k=1, where K is the number of node types, ∀vi ∈ Vk, φ(vi) = k, d |V |. Each mapping Φk defines the latent representation (a.k.a. embedding) of all nodes of type k, which captures the network topological information regarding the heterogeneous links in E. Compared with homogeneous networks, the definition of topological information in heterogeneous networks is even more diverse. As we will show in Section 3, the major distinctions among different HNE algorithms mostly lie in their different ways of capturing such topological informa- tion. Particularly, the leverage of meta-paths as in Def 2.3 often plays an essential role, since many popular HNE al- gorithms exactly aim to model the different proximity indi- cated by meta-paths [16, 22, 83, 7, 99, 82, 104, 63]. 2.2 Proposed Paradigm In this work, we stress that one of the most important principles underlying HNE (as well as most other scenarios of network modeling and mining) is homophily [43]. Partic- ularly, in the network embedding setting, homophily can be translated as ‘nodes close on a network should have similar embeddings’, which matches the requirement of Def 2.1. In fact, we further find intrinsic connections between the well- perceived homophily principle and widely-used smoothness enforcement technique on networks, which leads to a generic mathematical paradigm covering most existing and likely many future HNE algorithms. Based on earlier well-established concepts underlying net- work modeling and embedding learning [73, 56, 2, 110, 111], we introduce the following key objective function of network smoothness enforcement as follows u,v∈V J = wuvd(eu, ev) + JR, (1) where eu = Φ(u) and ev = Φ(v) are the node embedding vectors to be learned. wuv is the proximity weight, d(·,·) is the embedding distance function, and JR denotes possible additional objectives such as regularizers, all three of which can be defined and implemented differently by the particular HNE algorithms. KincadeWildfireDisasterCaliforniaOct 2019PHRASEPropertyDamageFloridaHurricaneFacebookAnnounces LibraTechnologyOct 2018PHRASEFloridaJP MorganNew Tech OfficeDigitalPayment
3. ALGORITHM TAXONOMY In this section, we find a universal taxonomy for existing HNE algorithms with three categories based on their com- mon objectives, and elaborate in detail how they all fit into our paradigm of Eq. (1). The main challenge of instantiat- ing Eq. (1) on heterogeneous networks is the consideration of complex interactions regarding multi-typed links and higher- order meta-paths. In fact, our Eq. (1) also readily general- izes to homogeneous networks, though that is beyond the scope of this work. 3.1 Proximity-Preserving Methods As mentioned above, one basic goal of network embed- ding is to capture network topological information. This can be achieved by preserving different types of proximity among nodes. There are two major categories of proximity- preserving methods in HNE: random walk-based approaches (inspired by DeepWalk [49]) and first/second-order proximity- based ones (inspired by LINE [72]). 3.1.1 Random Walk-Based metapath2vec [16]. Following homogeneous network em- bedding [49, 26], metapath2vec utilizes the node paths tra- versed by meta-path guided random walks to model the con- text of a node regarding heterogeneous semantics. Formally, l2−→ ··· lm−1−−−→ om, the given a meta-path M = o1 transition probability at step i is defined as l1−→ o2 1 |Nli 0 p(vi+1|vi,M) = (vi)| φ(vi+1) = oi+1, ψ(vi, vi+1) = li otherwise (2) where Nl(v) = {u|ψ(u, v) = l} denotes the neighbors of v associated with edge type l. Assume P = {P1, ...,PM} is the set of generated random walk sequences. The objective of metapath2vec is v∈V u∈C(v) log J = exp(eT u ev) u∈V exp(eT u ev) , (3) where C(v) is the contexts (i.e., skip-grams) of v in P. For example, if P1 = v1v2v3v4v5... and the context window size is 2, then {v1, v2, v4, v5} ⊆ C(v3). Let wuv be the number of times that u ∈ C(v) is in P. Eq. (3) can be rewritten as J = wuv log exp(eT u ev) u∈V exp(eT u ev) . u,v∈V p(M|u, v) = σ The denominator in this objective requires summing over all nodes and is computationally expensive. In actual compu- tation, it is approximated using negative sampling [44]. In this paper, we still analyze the original objective. HIN2Vec [22]. HIN2Vec considers the probability that there is a meta-path M between nodes u and v. Specifically, where 1 is an all-ones vector; is the Hadamard product; f01 is a normalization function. Here eu = W T X u, ev = W T dings of u, v and M, respectively. Let AM = diag(eM1, ..., eMd). We have 1T R m W T W T R m can be viewed as the embed- 1Teu ev eM p(M|u, v) = σ Y v and eM = f01 X u W T Y v f01 = σ(eT u AMev). W T , 4 σ is the sigmoid function, so we have 1 − p(M|u, v) = 1 − σ(eT u AMev) = σ(−eT u AMev). HIN2Vec generates positive tuples (u, v,M) (i.e., u connects with v via meta-path M) using homogeneous random walks [49] regardless of node/link types. For each positive tuple (u, v,M), it generates several negative tuples by replacing u with a random node u. Its objective is J0 = log(1 − p(M|u, v)) log p(M|u, v) + log σ(eT u AMev) + log σ(−eT u AMev) . (u,v,M) = (u,v,M) (u,v,M) u This is actually the negative sampling approximation of the following objective M u,v∈V J = M uv log w exp(eT u AMev) u∈V exp(eT u AMev) , uv is the number of path instances between u and where wM v following meta-path M. SHNE [99]. SHNE improves metapath2vec by incorporat- ing additional node information. It assumes that some types of nodes may have additional text information (denoted as xv). It proposes a deep encoder fENC based on GRU [12]. Namely, if v has text information, ev = fENC(xv); other- wise, ev represents typical learnable node embeddings. The objective of SHNE is the same as that of metapath2vec, and it is straightforward to extend it to incorporate other types of node information like categorical attributes, images, etc. by leveraging domain-specific deep encoders, which is also considered in HINE [8]. Other random walk-based approaches are summarized in Table 1. To be specific, MRWNN [85] incorporates con- tent priors into DeepWalk embedding; HHNE [83] extends metapath2vec to the hyperbolic space; GHE [11] proposes a semi-supervised meta-path weighting technique; MNE [100] conducts random walks separately for each view in a multi- view network; JUST [35] proposes Jump/Stay random walks that do not rely on pre-defined meta-paths. 3.1.2 First/Second-Order Proximity-Based PTE [71]. PTE proposes to decompose a heterogeneous network into multiple bipartite networks, each of which de- scribes one edge type. Its objective is the sum of log-likelihoods over all bipartite networks: exp(eT u ev) exp(eT u ev) u∈Vφ(u) u ev) exp(eT exp(eT u∈Vφ(u) . u ev) l∈TE J = = wl uv log u,v∈V u,v∈V wuv log uv = 0); wuv = Here TE is the set of edge types; wl uv is the type-l edge weight of (u, v) (if there is no edge between u and v with type l, then wl uv is the total edge weight between u and v. AspEm [66]. AspEm assumes that each heterogeneous net- work has multiple aspects, and each aspect is defined as a subgraph of the network schema [70]. An incompatibility l wl
, = J = measure is proposed to select appropriate aspects for em- bedding learning. Given an aspect a, its objective is E is the set of edge types in aspect a; Zl = u,aev,a) exp(eT u∈Vφ(u) exp(eT uv log u,v∈V 1 Zl l∈T a wl E u,aev,a) where T a u,v wl uv is a normalization factor; eu,a is the aspect-specific embed- ding of u. HEER [67]. HEER extends PTE by considering typed closeness. Specifically, each edge type l has an embedding µl, and its objective is l∈TE u,v∈V J = wl uv log exp(µT l guv) (u,v)∈Pl(u,v) exp(µT l guv ) , where guv is the edge embedding of (u, v); Pl(u, v) = {(u, v)| ψ(u, v) = l} ∪ {(u, v)|ψ(u, v) = l}. In [67], guv has differ- ent definitions for directed and undirected edges based on the Hadamard product. To simplify our discussion, we as- sume guv = eu ev. Let Al = diag(µl1, ..., µld). We have l guv = eT µT u Alev, and l∈TE u,v∈V J = wl uv log exp(eT u Alev) (u,v)∈Pl(u,v) exp(eT . u Alev) Other first/second-order proximity-based approaches are summarized in Table 1. To be specific, Chang et al. [8] introduce a node type-aware content encoder; CMF [108] performs joint matrix factorization over the decomposed bi- partite networks; HEBE [27] preserves proximity regarding each meta-graph; Phine [94] combines additional regulariza- tion towards semi-supervised training; MVE [53] proposes an attenion-based framework to consider multiple views of the same nodes. There are also studies adopting other forms of objectives to characterize first/second-order proximities. For exam- ple, SHINE [78] uses reconstruction loss of autoencoders in- pired by SDNE [76]; DRL [52] proposes to learn one type of edges at each step and uses deep reinforcement learning approaches to select the edge type for the next step; HINSE [90] adopts spectral embedding based on adjacency matrices with different meta-graphs; HeGAN [32] proposes an adver- sarial framework with a relation type-aware discriminator. Based on the above discussions, the objective of most proximity-preserving methods can be unified as u,v max J = = wuv log wuvs(u, v) − exp(s(u, v)) u exp(s(u, v)) wuv log + JR0 u,v u,v exp(s(u, v)) + JR0 (4) u uv or wl Here, wuv can be specific to a meta-path M or an edge type uv accordingly); JR0 is an algorithm- l (denoted as wM specific regularizer (which is summarized in Table 1); s(u, v) is the proximity function between u and v. Note that in most cases, we can write s(u, v) as f (eu)T f (ev). For example, AMeu f (eu) = eu in metapath2vec, PTE, etc.; f (eu) = √ √ Aleu in HEER. In these cases, in HIN2Vec; f (eu) = J = wuvf (eu)T f (ev) − − wuv · 1 2 ||f (eu)||2 + ||f (ev)||2 − ||f (eu) − f (ev)||2 exp(s(u, v)) + JR0. exp(s(u, v)) + JR0 wuv log wuv log u u,v u,v u,v The second step holds because ||x−y||2 = (x−y)T (x−y) = ||x||2 + ||y||2 − 2xT y. Therefore, our goal is equivalent to u,v min −J = u u,v − u,v u,v + wuv 2 wuv 2 −JR0 d(eu,ev ) ||f (eu) − f (ev)||2 ||f (eu)||2 + ||f (ev)||2 JR1 exp(s(u, v)) . u JR2 wuv log (5) Here JR1 and JR2 are two regularizers. Without JR1, d(eu, ev) can be minimized by letting ||f (ev)|| → 0; without JR2, d(eu, ev) can be minimized by letting ev ≡ c (∀v ∈ V ). JR in Eq. (1) is then the summation of the algorithm- specific regularizer −JR0 and −JR1, JR2, which are com- monly shared across most HNE algorithms in the proximity- preserving group. We summarize the choices of wuv, d(eu, ev) and JR0 in Table 1. 3.2 Message-Passing Methods Each node in a network can have attribute information represented as a feature vector xu. Message-passing meth- ods aim to learn node embeddings eu based on xu by aggre- gating the information from u’s neighbors. In recent studies, Graph Neural Networks (GNNs) [37] are widely adopted to facilitate this aggregation/message-passing process. R-GCN [57]. R-GCN has K convolutional layers. The initial node representation h(0) u is just the node feature xu. In the k-th convolutional layer, each representation vector is updated by accumulating the vectors of neighboring nodes through a normalized sum. h(k+1) u = σ 1 |Nl(u)| W (k) l h(k) v + W (k) 0 h(k) u . l∈TE v∈Nl(u) Different from regular GCNs [37], R-GCN considers edge heterogeneity by learning multiple convolution matrices W ’s, each corresponding to one edge type. During message pass- ing, neighbors under the same edge type will be aggregated and normalized first. The node embedding is the output of the K-th layer (i.e., ev = h(K) ). v In unsupervised settings, message-passing approaches use link prediction as their downstream task to train GNNs. To be specific, the likelihood of observing edges in the hetero- geneous network is maximized. R-GCN uses negative sam- pling and cross-entropy loss, which is the approximation of the following objective. l∈TE u,v∈V J = wl uv log u Alev) exp(eT u∈V exp(eT u Alev) , where wl uv = 1(u,v)∈El . 5
Algorithm MRWNN [85] metapath2vec [16] SHNE [99] HHNE [83] GHE [11] HIN2Vec [22] MNE [100] JUST [35] HNE [8] PTE [71] CMF [108] HEBE [27] Phine [94] MVE [53] AspEm [66] HEER [67] wuv / wl uv / wM number of times that u ∈ C(v) in homogeneous uv random walks number of times that u ∈ C(v) in heterogeneous random walks following meta-path M (Eq. (2)) number of meta-path instances between u and v following M number of times that u ∈ C(v) in homogeneous random walks in (V, El) number of times that u ∈ C(v) in Jump/Stay random walks 1(u,v)∈E edge weight of (u, v) edge weight of hyperedge (u, C) [27] number of times that u and v co-occur in a semantic pattern edge weight of (u, v) with type l d(eu, ev) ||eu − ev||2 − JR0 v ||ev − fENC(Xv)||2 Applications image retrieval ||eu − ev||2, eu = fENC(xu) dD(eu, ev) ||eu − ev||2 AM(eu − ev)||2 ||eu − ev,l||2 ||√ ||eu − ev||2 ||eu − ev||2, eu = Aφ(u)xu ||eu − ev||2 ||eu − eC||2, eC = v∈C ev/|C| ||eu − ev||2 ||eu,a − ev,a||2, a: aspect ||eu,l − ev||2 ||√ Al(eu − ev)||2 classification, clustering, link prediction and recommendation author identification classification, clustering, link prediction and recommendation text classification and image retrieval text classification relatedness measurement of Wikipedia entities event embedding user rating prediction classification and link prediction aspect mining relation prediction in knowledge graphs N/A − − o∈TV ||Ao||2 F N/A v ||ev||2 N/A supervised loss |ˆy − y|2 − l λv,l||ev,l − ev||2 v N/A Table 1: A summary of proximity-preserving based HNE algorithms. (Additional notations: fENC: a function to encode text/image/attribute information; dD: distance between two points in the Poincar´e ball.) HAN [82]. Instead of one-hop neighbors, HAN utilizes meta-paths to model higher-order proximity. Given a meta- path M, the representation of node u is aggregated from its meta-path based neighbors NM(u) = {u} ∪ {v|v con- nects with u via meta-path M}. HAN proposes an attention mechanism to learn the weights of different neighbors: M uv = α v]) expσ(aTM[x v∈NM(u) expσ(aTM[x u||x M uvx v α , M u = σ h v ]) , u||x v∈NM(u) where aM is the node-level attention vector of M; x u = Mφ(u)xu is the projected feature vector of node u; || is the concatenation operator. Given meta-path specific embedding hM u , HAN uses a semantic-level attention to weigh different meta-paths: exp 1|V | M exp 1|V | v + b) v + b) , v∈V qT tanh(W hM v∈V qT tanh(W hM βMh M u , βM = eu = M where q is the semantic-level attention vector. Very recently, [54] proposed HDGI to improve unsuper- vised training based on HAN, and [33] proposed HGT to enable temporal encoding for dynamic networks and mini- batch sampling for scalable training. Since HAN is mainly designed for semi-supervised node classification, when we learn node embeddings in unsupervised settings (like many proximity-preserving methods), we adopt the widely used link prediction loss in R-GCN and GraphSAGE [28]. 6 HetGNN [98]. HetGNN assumes that each node is asso- ciated with different features (e.g., categorical, textual and visual). It first encodes the content of each node to a vector hs u, and then adopts a node type-aware aggregation function to collect information from neighbors: u = fAGG({hs v|v ∈ NRWR(u), φ(v) = o}), hs u = fENC(xu), ho where NRWR(v) is the neighborhood of v defined through random walk with restart [74]. Then HetGNN uses attention over neighborhood node types to get the final embedding, which share the same spirits as HAN: u])) u ||hs o∈TV ∪{s} exp(LeakyReLU(aT [ho exp(LeakyReLU(aT [ho u||hs , u])) αo u = eu = uho αo u. o∈TV ∪{s} The objective of HetGNN is the same as that of PTE. There are some other message-passing approaches. For ex- ample, HEP [109] aggregates u’s representation from No(u) (i.e., the node type-aware neighborhood) with an application of e-commerce user alignment; GATNE [7] aggregates u’s representation from Nl(u) (i.e., the edge type-aware neigh- borhood) and is applicable for both transductive and induc- tive network embedding settings. The objective of message-passing approaches mentioned above can also be written as Eq. (4), where s(u, v) is a func- tion of eu and ev. The only difference is that eu here is aggregated from xv using GNNs. Following the derivation of proximity-preserving approaches, if we still write JR in Eq. (1) as the summation of −JR0 , −JR1 and JR2 , we can get the exactly same JR1 and JR2 as in Eq. (5).
Algorithm wuv / wl uv R-GCN [57] HEP [109] HAN [82] HetGNN [98] GATNE [7] 1(u,v)∈El 1(u,v)∈E edge weight of (u, v) number of times that u ∈ C(v) in random walks in (V, El) d(eu, ev) Al(eu − ev)||2 A(eu − ev)||2 ||√ ||√ ||eu − ev||2 ||eu,l − ev||2 h(k+1) u = σ hu = σ Wφ(u) 1 l∈TE r h(k) v∈Nl(u) 0 h(k) u v + W (k) Aggregation Function |Nl(u)| W (k) h(0) u = xu, eu = h(K) o∈TV eu =M βMσ {fENC(xv)|v ∈ NRWR(u), φ(v) = o} v∈NM(u) αM v,l |v ∈ Nl(u)} {h(k) l∈TE v∈No(u) αuvev eu,l = αlW T l uvMφ(u)xu u,l = fAGG u,l = xu au,l + bu , h(0) αo ufAGG + bφ(u) h(K) u,l u o∈Tv h(k+1) eu = Applications entity classification and KB completion user alignment node classification, node clustering, link prediction and recommendation Table 2: A summary of message-passing based HNE algorithms. (Additional notations: Nl(u): neighbors of u with edge type l; No(u): neighbors of u with node type o; fAGG: a function to aggregate information from neighbors. For HAN, we assume it adopts the unsupervised loss in GraphSAGE [28]. For GATNE, we show the transductive version.) tional reconstruction loss JR0 = Within this group of algorithms, only HEP has an addi- v ||ev − hv||2, while all other algorithms have JR0 = 0. We summarize the choices of wuv, d(eu, ev) and the aggregation function in Table 2. 3.3 Relation-Learning Methods Each edge in a heterogeneous network can be viewed as a triplet (u, l, v) composed of two nodes u, v ∈ V and an edge type l ∈ TE (i.e., entities and relations, in the termi- nology of KG). The goal of relation-learning methods is to learn a scoring function sl(u, v) which evaluates an arbitrary triplet and outputs a scalar to measure the acceptability of this triplet. This idea is widely adopted in KB embedding. Since there are surveys of KB embedding algorithms already [81], we only cover the most popular approaches here and highlight their connections to HNE. TransE [3]. TransE assumes that the relation induced by l- labeled edges corresponds to a translation of the embedding (i.e., eu + el ≈ ev) when (u, l, v) holds. Therefore, the scoring function of TransE is defined as sl(u, v) = −||eu + el − ev||p, where p = 1 or 2. The objective is to minimize a margin-based ranking loss. J = max(0, γ − sl(u, v) + sl(u )), , v (u,l,v)∈T (u,l,v)∈T (u,l,v) where T is the set of positive triplets (i.e., edges); T (u,l,v) is the set of corrupted triplets, which are constructed by replacing either u or v with an arbitrary node. Formally, (6) (u,l,v) = {(u T , l, v)|u ∈ V } ∪ {(u, l, v )|v ∈ V }. TransE is the most representative model using “a transla- tional distance” to define the scoring function. Its extensions include TransH [84], TransR [40], RHINE [41], etc. DistMult [88]. In contrast to translational distance mod- els [3, 84, 40], DistMult exploits a similarity-based scoring function. Each relation is represented by a diagonal ma- trix Al = diag(el1, ..., eld), and sl(u, v) is defined using a bilinear function: sl(u, v) = eT u Alev. Note that sl(u, v) = sl(v, u) for any u and v. Therefore, DistMult is mainly designed for symmetric relations. Besides DistMult, RESCAL [46] also uses a bilinear scor- ing function, but Al is no longer restricted to be diagonal. ConvE [15]. ConvE goes beyond simple distance or sim- ilarity functions and proposes deep neural models to score a triplet. The score is defined by a convolution over 2D shaped embeddings. Formally, sl(u, v) = f (vec(f ([Eu; Er] ∗ ω))W )ev, where Eu and Er denote the 2D reshaping matrices of node embedding and relation embedding, respectively; “∗” is the convolution operator. Other models based on deep neural scoring functions in- clude NTN [68], CACL [48], SACN [60], NKGE [82], etc. All relation-learning based embedding algorithms men- tioned above adopt a margin-based ranking loss with some regularization terms that generalizes Eq. (6): max(0, γ − sl(u, v) + sl(u )) +JR0. (7) , v In [51], Qiu et al. point out that the margin-based loss shares a very similar form as the following negative sampling loss: log(σ(sl(u , v ))) . log(σ(sl(u, v))) − bE(u,l,v) (u,l,v) wl uv (u,l,v) − (u,l,v) Following [51], if we use use negative sampling loss to rewrite Eq. (7), we are approximately maximizing (u,l,v) J = wl uv log exp(sl(u, v)) (u,l,v) exp(sl(u, v)) + JR0. For translational distance models [3, 84, 40, 41] where sl(u, v) is described by distance, this is equivalent to −JR0 uv ||eu + el − ev||1 or 2 wl min −J = (u,l,v) d(eu,ev ) exp(sl(u, v)) JR1 + (u,l,v) wl uv log (u,l,v) . (8) In this case, we can write JR as −JR0 + JR1. For neural triplet scorers [15, 68, 48, 60], the forms of sl(u, v) are more complicated than the inner product or dis- tance. In these cases, since distance and proximity can be viewed as reverse metrics, we define d(eu, ev) as C−sl(u, v), where C is an constant upper bound of sl(·,·). Then the derivation of the loss function is similar to that of Eq. (8), and we will have JR = −JR0 + JR1. uv, d(eu, ev) and JR0 in We summarize the choices of wl Table 3. Note that for relation learning methods, d(·,·) may not be a distance metric. For example, d(eu, ev) = d(ev, eu) in most translational distance models. This is intuitive be- cause (u, l, v) and (v, l, u) often express different meanings. 7
wl uv 1(u,v)∈El edge weight of (u, v) with type l 1(u,v)∈El Algorithm TransE [3] TransH [84] TransR [40] RHINE [41] DistMult [88] NTN [68] ConvE [15] CACL [48] SACN [60] NKGE [79] d(eu, ev) ||eu + el − ev|| ||eu,l + el − ev,l||2, ev,l = ev − wT l evwl ||eu,l + el − ev,l||2, ev,l = Alev ||eu − ev||2 if l models affiliation, ||eu + el − ev|| if l models interaction Al(eu − ev)||2 ||√ u M ev + Ml,1eu + Ml,2ev + bl) r tanh(eT C − σ(vec(σ([Eu; Er] ∗ ω))W )ev v σ(wu eu + wr er + b) + b0) C − f (vec(M (eu, el))W )ev C − f (eT C − eT same as TransE or ConvE, where eu = σ(gu) es u + (1 − σ(gu)) en u JR0 l(||wl|| − 1) + v(||ev|| − 1) (wT ||el||2 − 2 v[||ev|| − 1]+ + l el)2 + v[||Alev|| − 1]+ l v[||ev|| − 1]++ + l[||el|| − 1]+ N/A ||Θ||2 2 N/A ||Θ||2 2 N/A ||Θ||2 2 when using the TransE objective Applications KB completion and relation extraction from text link prediction and node classification KB completion and rule extraction triplet classification KB completion and triplet classification KB completion Table 3: A summary of relation-learning based HNE algorithms. (Additional notations: f : a non-linear activation function; [x]+: max{x, 0}; Θ: the set of learned parameters; Eu, Er: 2D reshaping matrices of eu and er [15]; M (eu, el): a matrix aligning the output vectors from the convolution with all kernels [60].) 4. BENCHMARK 4.1 Dataset Preparation 4.2 Structure Analysis Towards off-the-shelf evaluation of HNE algorithms in a systematic and comprehensive manner, in this work, we spare no effort to collect, process, analyze, and publish four real-world heterogeneous network datasets from different do- mains, which we aim to set up as a benchmark for existing and future HNE algorithms. DBLP. We construct a network of authors, papers, venues, and phrases from DBLP. Phrases are extracted by the popu- lar AutoPhrase [61] algorithm from paper texts and further filtered by human experts. We compute word2vec [44] on all paper texts and aggregate the word embeddings to get 300- dim paper and phrase features. Author and venue features are the aggregations of their corresponding paper features. We further manually label a relatively small portion of au- thors into 12 research groups from four research areas by crawling the web. Each labeled author has only one label. Yelp. We construct a network of businesses, users, loca- tions, and reviews from Yelp.4 Nodes are not associated with any feature, but a large portion of businesses are la- beled into sixteen categories. Each labeled business has one or multiple labels. Freebase. We construct a network of books, films, mu- sic, sports, people, locations, organizations, and businesses from Freebase.5 Nodes are not associated with any features, but a large portion of books are labeled into eight genres of literature. Each labeled book has only one label. PubMed. We construct a network of genes, diseases, chem- icals, and species from PubMed.6 All nodes are extracted by AutoPhrase [61], typed by AutoNER [62], and further filtered by human experts. We compute word2vec [44] on all PubMed papers and aggregate the word embeddings to get 200-dim features for all types of nodes. We further label a relatively small portion of diseases into eight categories. Each labeled disease has only one label. 4https://www.yelp.com/dataset/challenge 5http://www.freebase.com/ 6https://www.ncbi.nlm.nih.gov/pubmed/ A summary of the statistics on the four datasets is pro- vided in Table 4 and Figure 3. As can be observed, the datasets have different sizes (numbers of nodes and links) and heterogeneity (numbers and distributions of node/link types). Moreover, due to the nature of the data sources, DBLP and PubMed networks are attributed, whereas Yelp and Freebase networks are abundantly labeled. A combina- tion of these four datasets thus allows researchers to flexibly start by testing an HNE algorithm in the most appropriate settings, and eventually complete an all-around evaluations over all settings. We also provide comprehensive analysis regarding sev- eral most widely concerned properties of heterogeneous net- works, i.e., degree distribution (Figure 4), clustering coeffi- cient (Figure 5), and number of frequent meta-paths (Figure 6). In particular, degree distribution is known to signifi- cantly influence the performance of HNE algorithms due to the widely used node sampling process, whereas clustering coefficient impacts HNE algorithms that utilize latent com- munity structures. Moreover, since many HNE algorithms rely on meta-paths, the skewer distribution of meta-paths can bias towards algorithms using fewer meta-paths. As we can see, the properties we concern are rather differ- ent across the four datasets we prepare. For example, there are tighter links and more labels in Yelp, while there are more types of nodes and links in Freebase; compared with nodes in Freebase and PubMed which clearly follow the long- tail degree distribution, certain types of nodes in DBLP and Yelp are always well connected (e.g., phrases in DBLP and businesses in Yelp), forming more star-shaped subgraphs; the type-wise clustering coefficients and meta-path distri- butions are the most skewed in DBLP and most balanced in PubMed. The set of four datasets together provide a comprehensive benchmark towards the robustness and gen- eralizability of various HNE algorithms (as we will also see in Section 5.2). 4.3 Settings, Tasks, and Metrics We mainly compare all ten algorithms under the setting of unsupervised unattributed HNE over all datasets, where the essential goal is to preserve different types of edges in the heterogeneous networks. Moreover, for message-passing algorithms that are particularly designed for attributed and 8
分享到:
收藏