Researchers from Microsoft's Natural Language Computing (NLC) group announced the latest version of Bidirectional Encoder representation from Image Transformers: BEiT-3, a 1.9B parameter vision-language AI model. BEiT-3 models images as another language and achieves state-of-the-art performance on a wide range of downstream tasks.
The model and experiments were described in a paper published on arXiv. The key idea in BEiT-3 is to model images as another language (which the authors call "Imglish"); this allows the model to be pretrained using only the masked language modeling (MLM) objective, and the training process can therefore be scaled up more easily. This unified architecture allows BEiT-3 to support a wide range of downstream tasks: in evaluation experiments, the model set new state-of-the-art performance records on several benchmarks, including semantic segmentation, cross-modal retrieval, and visual question answering. According to the Microsoft team:
BEIT-3 is simple and effective, and is a promising direction for scaling up multimodal foundation models. For future work, we are working on pretraining multilingual BEIT-3 and including more modalities (e.g., audio) in BEIT-3 to facilitate the cross-lingual and cross-modality transfer, and advance the big convergence of large-scale pretraining across tasks, languages, and modalities.
Since its original publication in 2017, the Transformer model has become the preferred architecture for many natural language processing (NLP) tasks. This led to many researchers adopting the Transformer for vision tasks, and then combining both NLP and vision in a single model. However, these multi-modal architectures often maintain separate encoder modules for the different inputs and require multiple pretraining objectives beyond the standard MLM objective.
By contrast, BEiT-3 uses the Multiway Transformer architecture, which uses a shared self-attention module for both image and text data. The attention head output is then routed to a modality-specific feedforward "expert" module. Because the model pretraining uses only the MLM objective, a smaller batch size can be used, which reduces the amount of GPU memory required to process a training batch.
BEiT-3 was pretrained on several publicly-available text and image datasets, including ImageNet, COCO, and the contents of Wikipedia; overall the data contained 160GB of text-only documents, 14M images, and 21M text-image pairs. The team evaluated the model on several vision and vision-language benchmark tasks, including: semantic segmentation on the ADE20K; object detection, instance segmentation, image captioning, and retrieval on COCO; retrieval on Flickr30K; and visual question answering on VQAv2. BEiT-3 outperformed previous models on most of the tasks; a full list of results is available on Papers with Code.
Large vision-language models are an active research topic. InfoQ covered Google's Vision Transformer (ViT) in 2021 and DeepMind's Flamingo model earlier this year. In a Twitter discussion about BEiT-3, Google researcher and ViT co-author Lucas Beyer praised the BEiT-3 team, saying:
Impressive work, congrats! Happy to see someone else successfully training a ViT-G...and such good results on only 35M images does make the masked modeling task appealing!
Microsoft recently released the code and model weights for BEiT v2 on GitHub, and co-author Li Dong said on Twitter that BEiT-3 should be open-sourced soon.