Facebook AI Research is open-sourcing PyTorch-BigGraph, a distributed system that can learn embeddings for graphs with billions of nodes.
A graph is a data structure that represents relationships (or edges) between entities (or nodes). For example, Facebook's social graph represents the friendships between real people. One challenge of using a graph as input to machine learning is that the size of the data structure is proportional to the square of the number of entities. Using such a data structure as an input for machine learning can lead to extremely large models. Similar problems occur in natural language processing (NLP), and one common solution is to use an embedding as the first stage in the model.
An embedding is a mapping from a vector space with higher dimensions to one with fewer, that preserves some feature of the original space. NLP tasks often use a word embedding such as word2vec; in this embedding, the coordinates for words with similar meaning are closer to each other than to words with different meanings. Similarly, in a graph embedding, nodes that share an edge would have coordinates closer to each other than to nodes with no shared edge. Once an embedding is created, it can be used by machine learning tasks to transform input data into a more compact form, making the subsequent model much simpler. Graph embeddings often map from an original space with millions of dimensions into one with fewer than one thousand.
However, embeddings themselves are constructed from a deep-neural-network (DNN) trained using unsupervised learning techniques, and the large space of the input data makes that task difficult: training takes a long time, and the DNN model has so many parameters that they may not fit into the memory of the server performing the training.
To overcome the latter problem, PyTorch-BigGraph (PBG) divides the nodes of the graph into multiple partitions, sized such that two partitions can fit into a machine's memory. The edges are partitioned into "buckets", with a bucket containing the edges that connect nodes from one partition with the nodes from another partition. Training is performed on one bucket at a time. The training can be done on one machine, or across multiple machines using distributed training to decrease training time.
In a paper that was presented at the recent SysML Conference, experimental results on publicly available social graph data sets show that "PBG outperforms competing methods," such as DeepWalk and MILE.
In addition to releasing the source code, Facebook has also published a pre-trained model containing embeddings of the full WikiData graph. This model contains 78 million entities mapped into a 200-dimensional vector space. Facebook hopes that their work "encourages practitioners to release and experiment with even larger data sets."