BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Meta Open-Sources Byte Latent Transformer LLM with Improved Scalability

Meta Open-Sources Byte Latent Transformer LLM with Improved Scalability

Meta open-sourced Byte Latent Transformer (BLT), an LLM architecture that uses a learned dynamic scheme for processing patches of bytes instead of a tokenizer. This allows BLT models to match the performance of Llama 3 models but with 50% fewer inference FLOPS.

Most LLMs map text bytes into a fixed set of tokens, which has several drawbacks, including the famous strawberry problem. By contrast, BLT dynamically groups bytes into patches. It uses a small language model to compute the entropy of the next byte in a sequence and then starts a new patch when the entropy increases; essentially, the small model is predicting the end of a word, a relatively easy task compared to generating new words in a sequence. Because BLT is working directly with bytes, it is more robust to noisy inputs that have spelling mistakes. Increasing patch size can reduce FLOPS needed for inference, resulting in a larger model with better performance for the same compute budget. According to Meta, 

BLT unlocks a new dimension for scaling, allowing simultaneous increases in model and patch size within a fixed inference budget. This new paradigm becomes advantageous for compute regimes commonly encountered in practical settings. While directly engaging with raw byte data, BLT also improves the model’s ability to handle the long-tail of data, offering significant improvements in robustness to noisy inputs and a deeper understanding of sub-word structures. Overall, these results position BLT as a promising alternative to traditional tokenization-based approaches, providing a scalable and robust framework for more efficient and adaptable language models.

Most LLMs, like Llama, operate on a fixed set of tokens, and sequences of input bytes are mapped onto a token using heuristics. Tokenization is needed because training an LLM on raw bytes instead of tokens would require too much computation, but it does have some disadvantages. Besides struggling with counting individual letters in words, tokenization can affect an LLM's ability to handle multiple languages and understand mis-typed words.

Meta did a series of experiments evaluating BLT, comparing it to token-based models. They found that while a fixed inference compute budget determines a token-based model's size, allowing patch size to increase allows for a larger BLT model and therefore better model accuracy. They also found that BLT models outperformed Llama 3 on character-level tasks, such as noisy input or low-resource language translation. However, when the researchers tried converting a Llama 3 model to BLT, instead of training a new model end-to-end, they found that it had a "significant" drop in performance on several LLM benchmarks.

In a discussion about BLT on Reddit, several users pointed out how BLT could help models solve the "strawberry problem." Another user wrote:

[BLT] is 100% the way to go. Also makes multimodality easy since you can just represent any data or file in bytes, and there exist A LOT of files. One problem is that 2 MB would need a context size of 2 million, so the memory and compute requirements are not quite met yet.

The BLT training and inference code are available on GitHub.

About the Author

Rate this Article

Adoption
Style

BT