Facebook AI Research (FAIR) open-sourced Expire-Span, a deep-learning technique that learns which items in an input sequence should be remembered, reducing the memory and computation requirements for AI. FAIR showed that Transformer models that incorporate Expire-Span can scale to sequences of tens of thousands of items with improved performance compared to previous models.
The research team described the technique and several experiments in a paper to be presented at the upcoming International Conference on Machine Learning (ICML). Expire-Span allows sequential AI models to "forget" events that are no longer relevant. When incorporated into self-attention models, such as the Transformer, Expire-Span reduces the amount of memory needed, allowing the model to handle longer sequences, which is key to improved performance on many tasks, such as natural language processing (NLP). Using Expire-Span, the team trained models that could handle sequences up to 128k, an order of magnitude more than previous models, with improved accuracy and efficiency compared to baselines. Research scientists and paper co-authors Angela Fan and Sainbayar Sukhbaatar wrote on FAIR's blog:
As a next step in our research toward more humanlike AI systems, we’re studying how to incorporate different types of memories into neural networks. So, in the long term, we can bring AI even closer to humanlike memory with capabilities of learning much faster than current systems. We believe Expire-Span is an important, exciting advancement toward such futuristic AI-powered innovations.
Several common AI applications, such as image captioning or language translation, can be modeled as sequence learning; that is, predicting the next item in a sequence of data. The Transformer neural-network architecture is a common choice for sequence learning, especially in the natural-language processing (NLP) domain; for example, the "T" in OpenAI's GPT-3 stands for "Transformer." A Transformer has a self-attention mechanism that allows the network to "remember" previous items in the sequence; however, since self-attention can link each item in the sequence to every other item, the computational and memory complexity of self-attention is \(O(n^2)\), where n is the maximum sequence length that can be processed. This puts a practical limit on sequence length of around 1,024 items, due to the memory constraints of GPUs.
Several researchers have proposed modifications to the attention mechanism to increase the maximum sequence length. In 2019, OpenAI introduced Sparse Transformers, which reduced the attention complexity to \(O(n \sqrt{n})\). Last year, Google open-sourced Performer, which reduced complexity even further to \(O(n)\). Other techniques include Compressive Transformer, developed by Google's DeepMind subsidiary in 2019, and Adaptive Span, also published in 2019 by a FAIR team led by Expire-Span's Sukhbaatar.
A Transformer maintains a sequence of hidden states or "memories," and the output of the model at each time step is computed from a combination of these memories. Expire-Span works by computing a time-to-live (TTL) for each memory. The training loss is updated to penalize longer TTLs, which pushes the model to retain only relevant memories. To prevent overfitting on longer sequences, the memory is randomly shortened during training.
To evaluate the performance of Expire-Span, the team chose three baseline Transformer models---Transformer-XL, Compressive Transformer, and Adaptive-Span---and compared model accuracy as well as GPU memory and training speed. The models were used for several reinforcement learning (RL) and NLP tasks. Expire-Span outperformed the baselines on most experiments; for example, on a sequence-copy task, Expire-Span scaled to 128k sequence length and achieved 52.1% accuracy, compared to Transform-XL with 26.7% accuracy on a 2k sequence length.
The Expire-Span code is available on GitHub.