Introducing PyTorch BigGraph
Graphs are one of the fundamental data structures in machine
learning applications. Specifically, graph-embedding methods are a
form of unsupervised learning, in that they learn representations of
nodes using the native graph structure. Training data in mainstream
scenarios such as social media predictions, internet of things(IOT)
pattern detection or drug-sequence modeling are naturally
represented using graph structures. Any one of those scenarios can
easily produce graphs with billions of interconnected nodes. While
the richness and intrinsic navigation capabilities of graph structures
is a great playground for machine learning models, their complexity
posses massive scalability challenges. Not surprisingly, the support
for large-scale graph data structures in modern deep learning
frameworks is still quite limited. Recently, Facebook unveiled
PyTorch BigGraph, a new framework that makes it much faster and
easier to produce graph embeddings for extremely large graphs in
PyTorch models.
F
a
c
e
b
o
o
k
’
s
N
e
w
F
r
a
m
e
w
o
r
k
f
o
r
P
r
o
c
e
s
s
i
n
g
L
a
r
g
e
G
r
a
p
h
s
J
e
s
u
s
R
o
d
r
i
g
u
e
z
F
o
l
l
o
w
A
p
r
3
·
6
m
i
n
r
e
a
d
To some extent, graph structures can be seen as an alternative to
labeled training dataset as the connections between the nodes can be
used to infer specific relationships. This is the approach followed by
unsupervised graph embedding methods which learn a vector
representation of each node in a graph by optimizing the objective
that the embeddings for pairs of nodes with edges between them are
closer together than pairs of nodes without a shared edge. This is
similar to how word embeddings like word2vec are trained on text.
Most graph embedding methods result quite constrained when
applied to large graph structures. To give a example, a model with
two billion nodes and 100 embedding parameters per node
(expressed as floats) would require 800GB of memory just to store its
parameters, thus many standard methods exceed the memory
capacity of typical commodity servers. To represents a major
challenge for deep learning models and is the genesis of Facebook’s
BigGraph framework.
PyTorch BigGraph
The goal of PyTorch BigGraph(PBG) is to enable graph embedding
models to scale to graphs with billions of nodes and trillions of edges.
PBG achieves that by enabling four fundamental building blocks:
•
•
graph partitioning, so that the model does not have to be fully
loaded into memory
multithreaded computation on each machine
•
•
distributed execution across multiple machines (optional), all
simultaneously operating on disjoint parts of the graph
batched negative sampling, allowing for processing >1 million
edges/sec/machine with 100 negatives per edge
PBG addresses some of the shortcomings of traditional graph
embedding methods by partitioning the graph structure into
randomly divided into P partitions that are sized so that two
partitions can fit in memory. For example, if an edge has a source in
partition p1 and destination in partition p2 then it is placed into
bucket (p1, p2). In the same model, the graph edges are then divided
into P2 buckets based on their source and destination node. Once the
nodes and edges are partitioned, training can be performed on one
bucket at a time. The training of bucket (p1, p2) only requires the
embeddings for partitions p1 and p2 to be stored in memory. The
PBG structure guarantees that buckets have at least one previously-
trained embedding partition.
Another area in which PBG really innovates is the parallelization and
distribution of the training mechanics. PBG uses PyTorch
parallelization primitives to implement a distributed training model
that leverages the block partition structure illustrated previously. In
this model, individual machines coordinate to train on disjoint
buckets using a lock server which parcels out buckets to the workers
in order to minimize communication between the different machines.
Each machine can train the model in parallel using different buckets.
In the previous figure, the Trainer module in machine 2 requests a
bucket from the lock server on machine 1, which locks that bucket’s
partitions. The trainer then saves any partitions that it is no longer
using and loads new partitions that it needs to and from the sharded
partition servers, at which point it can release its old partitions on the
lock server. Edges are then loaded from a shared filesystem, and
training occurs on multiple threads without inter-thread
synchronization. In a separate thread, a small number of shared
parameters are continuously synchronized with a sharded parameter
server. Model checkpoints are occasionally written to the shared
filesystem from the trainers. This model allows a set of P buckets to
be parallelized using up to P/2 machines.
One of the indirect innovations of PBG is the use of batched negative
sampling techniques. Traditional graph embedding models, construct
random “false” edges as negative training examples along with the
true positive edges. This significantly speeds training because only a
small percentage of weights must be updated with each new sample.
However, the negative samples end up introducing a performance
overhead in the processing of the graph and end up “corrupting” true
edges with random source or destination nodes. PBG introduces a
method that reuses a single batch of N random nodes to produce
corrupted negative samples for N training edges. In comparison to
other embedding methods, this technique allows us to train on many
negative examples per true edge at little computational cost.
To increase memory efficiency and computational resources on large
graphs, PBG leverages a single batch of Bn sampled source or
destination nodes to construct multiple negative examples.In a
typical setup, PBG takes a batch of B = 1000 positive edges from the
training set, and breaks it into chunks of 50 edges. The destination
(equivalently, source) embeddings from each chunk is concatenated
with 50 embeddings sampled uniformly from the tail entity type. The
outer product of the 50 positives with the 200 sampled nodes equates
to 9900 negative examples.
The batched negative sampling approach has a direct impact in the
speed of the training of the models. Without batching, the speed of
training is inversely proportional to the number of negative samples.
Batched training improves that equation achieving constant training
speed.
Facebook evaluated PGB using different graph datasets such as
LiveJournal, Twitter data and YouTube user interaction data.
Additionally, PBG was benchmarked using the Freebase knowledge
graph, which contains more than 120 million nodes and 2.7 billion
edges as well as a smaller subset of the Freebase graph, known as
FB15k, which contains 15,000 nodes and 600,000 edges and is
commonly used as a benchmark for multi-relation embedding
methods. The FB15k experiments showed PBG performing similarly
to state of the art graph embedding models. However, when
evaluated against the full Freebase dataset, PBG show memory
consumptions improves by over 88%.
PBG is one of the first methods that can scale and the training and
processing of graph data to structures with billions of nodes and
trillions of edges. The first implementation of PBG has been open
sourced in GitHub and we should expect interesting contributions in
the near future.