Many large language models (LLMs) have become available recently, both closed and open source, further leading to the creation of combined models known as Multimodal LLMs (MLLMs). Yet, few or none of them unveil what design choices were made to create them, say Apple researchers who distilled principles and lessons to design state-of-the-art (SOTA) Multimodal LLMs.
Multimodal large language models are built by combining a large language model and a vision foundation model into a single model. MMLMs, which according to Apple researchers "are emerging as the next frontier in foundation models", aim at consuming image and text inputs to generate text data in a way that outperforms the foundation models they build upon.
Apple researchers focused on two aspects of the process that lead to the creation of MLLMs: decisions about the model architecture and choices for pre-training data.
On the first front, they found that image resolution, visual encoder loss and capacity, and visual encoder pre-training data are the three most important design aspects. On the contrary, architectural decisions regarding how visual data is fed into the LLM do not seem to affect the resulting model performance.
Regarding pre-training, the researchers analyzed three different approaches —image-caption, interleaved image-text, and text-only data— in few-shot, zero-shot, and text-only contexts. Zero-shot models are trained to recognize and classify objects or concepts without necessarily having previously seen any examples of them. In few-shot training, the focus is instead on models that can make accurate predictions based on training that includes only a very small number of labeled examples.
The outcome was that interleaved and text-only training data is key for few-shot and text-only model performance, while caption data is key for zero-shot models.
To prove their results, the researchers built a family of models, dubbed MM1, outperforming current state-of-the-art models, including Emu2, Flamingo, and IDEFICS. Benchmarking was done on captioning, where the model provides a descriptive caption of an image, and visual question answering, where the model answers questions about an image and helps understand its content.
Thanks to large-scale multimodal pre-training [...] MM1 enjoys appealing properties such as in-context predictions, multi-image and chain-of-thought reasoning. MM1 also enables strong few-shot learning capability after instruction tuning. These strong results demonstrate that the presented recipe for building MLLMs translates the design principles to a competitive model at scale.
As the researchers explain in their paper, to get these levels of performance with MM1, they investigated different image encoders as well as ways of connecting them to LLMs; different types of data and how to set weights; and how to train the MLLM, including its hyperparameters. Their results include insights such as the importance of image resolution, model size, training data composition, and so on, which they hope can provide a solid foundation for the community to build stronger models across multiple architectures and data strategies.