Google Research has open-sourced ByT5, a natural language processing (NLP) AI model that operates on raw bytes instead of abstract tokens. Compared to baseline models, ByT5 is more accurate on several benchmark tasks and is more robust to misspellings and noise.
The system and several experiments were described in a paper published on arXiv. ByT5 is based on the multilingual T5 model (mT5), but instead of mapping words to token IDs in a dictionary, it operates directly on the byte-level representation of text. This eliminates several problems with token-based models, which cannot handle words that are not in the original training data. Although byte-level processing results in slower inference times compared to the baseline, ByT5 is much more robust to noise than mT5, experiencing only a 1% accuracy degradation compared to 25% for mT5, when given text with random casing changes. ByT5 also outperforms mT5 on generative tasks as well as on the XTREME multilingual NLP benchmark.
Deep-learning models for NLP tasks originally represented text strings as a sequence of word tokens, each of which is given a unique numeric identifier (ID)---usually an index into a fixed list of vocabulary words. While the token IDs are a compact and efficient representation, these models cannot process words that are not in the vocabulary; this includes words not seen in the training data as well as misspellings of known words. All out-of-vocabulary words are mapped to a single "unknown" token, which makes the model unable to distinguish among any of these words.
More recently, models have attempted to address this by using sub-word tokenizers, such as SentencePiece, which split words into smaller units. While this can handle unseen words by decomposing them into known sub-units, these models are still affected by misspellings and morphological changes (e.g., "run" vs. "ran"). They also cannot handle unseen characters, such as those from other languages. This has led researchers to explore byte-level models, such as Google's CANINE model, a version of BERT that operates on raw bytes. One downside of byte-level models is that the resulting sequences fed into the later layers of the model are much longer than in tokenized ones, resulting in larger and slower models, since model size and computation cost both increase with the maximum supported sequence length.
To implement ByT5, the researchers modified the mT5 architecture in three ways. First, instead of using SentencePiece to tokenize the input text, the UTF-8 bytes are fed into the model directly; however, any bytes that are invalid in UTF-8 are dropped, and two special tokens are used to represent padding and end-of-sentence. Next, the pre-training span corruption task was modified to mask an average of 20 bytes, compared to mT5 masking an average of 3 tokens. Finally, instead of using the same number of Transformer blocks for the encoder and decoder, ByT5's encoder depth is 3x that of the decoder.
The team conducted several experiments to compare the performance of ByT5 to mT5. Each of the architectures was used to create five models of varying size---denoted Small, Base, Large, XL, and XXL---and evaluated on several benchmarks. On the GLUE and SuperGLUE language understanding benchmarks, ByT5 was more accurate than mT5 at the Small and Base sizes, but mT5 was better at the larger sizes. On the generative benchmarks XSum and TweetQA, ByT5 outperformed mT5 on both, for all model sizes. On the XTREME in-language task, where models are fine-tuned on "gold" data in all target languages, ByT5 outperformed mT5 across all tasks and model sizes, although results were mixed on other tasks. However, ByT5 was slower, taking 33% more clock time to complete pre-training, and with inference times up to 2x slower for classification and 7x slower for generation.
In a discussion on Twitter, a user asked how this new model compared to "old school" character-level models. Paper co-author Noah Constant replied:
Conceptually very similar. Some differences are (1) using bytes vs. characters, (2) using encoder-decoder vs. decoder-only, (3) using span-corruption to learn bidirectional effects vs. pure LM objective, (4) evaluating on transfer learning tasks rather than just LM perplexity.
The ByT5 code and pre-trained model files are available on GitHub.