Researchers at Google have developed a new deep-learning model called BigBird that allows Transformer neural networks to process sequences up to 8x longer than previously possible. Networks based on this model achieved new state-of-the-art performance levels on natural-language processing (NLP) and genomics tasks.
The team described the model and a set of experiments in a paper published on arXiv. BigBird is a new self-attention model that reduces the neural-network complexity of Transformers, allowing for training and inference using longer input sequences. By increasing sequence length up to 8x, the team was able to achieve new state-of-the-art performance on several NLP tasks, including question-answering and document summarization. The team also used BigBird to develop a new application for Transformer models in genomic sequence representations, improving accuracy over previous models by 5 percentage points.
The Transformer has become the neural-network architecture of choice for sequence learning, especially in the NLP domain. It has several advantages over recurrent neural-network (RNN) architectures; in particular, the self-attention mechanism that allows the network to "remember" previous items in the sequence can be executed in parallel on the entire sequence, which speeds up training and inference. However, since self-attention can link (or "attend") each item in the sequence to every other item, the computational and memory complexity of self-attention is O(n^2), where n is the maximum sequence length that can be processed. This puts a practical limit on sequence length, around 512 items, that can be handled by current hardware.
BigBird is a new self-attention scheme that has complexity of O(n), which allows for sequence lengths of up to 4,096 items. Instead of each item attending to every other item, BigBird combines three smaller attention mechanisms. First is random attention, which links each item with a small constant number of other items, chosen randomly. Next, window attention links each item with a constant number of items that precede and succeed it in the sequence. Finally, global attention links items at certain sequence locations with every other item.
For their NLP experiments, the team used a BERT-based model architecture, with the attention mechanism replaced with BigBird, and compared their model's performance with RoBERTA and with Longformer, another recent attention model which also has complexity of O(n). The BigBird model outperformed both other models on four question-answering datasets: Natural Questions, HotpotQA-distractor, TriviaQA-wiki, and WikiHop. BigBird was also compared to RoBERTA on several document classification datasets; BigBird not only outperformed RoBERTA, but also set a new state-of-the-art score on the Arxiv dataset, with an F1 score of 92.31% compared to the previous record of 87.96%. Besides NLP tasks, the team also showed that BigBird's longer sequence capabilities could be used to build models for genomics applications. BigBird outperformed several baseline models on two genomics classification tasks: promoter region prediction and chromatin-profile prediction. BigBird achieved a 99.9% accuracy on the former task, an improvement of 5 percentage points over the previous best model.
One of BigBird's co-creators, Philip Pham, joined a Hacker News discussion about the paper. He noted that although the experiments in the paper used a sequence length of 4,096, the model could handle much larger sequences of up to 16k. When asked to compare BigBird to GPT-3, Pham replied:
We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context.
Google has not released the source code for the models used in the paper. The original BERT code is available on GitHub, as is the code for RoBERTA and Longformer.