Researchers at Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) have open-sourced 4M-21, a single any-to-any AI model that can handle 21 input and output modalities. 4M-21 performs well "out of the box" on several vision benchmarks and is available under the Apache 2.0 license.
4M-21 is a 3B-parameter Transformer-based encoder-decoder model. All 21 input modalities are mapped to discrete tokens using modality-specific tokenizers, and the model can generate any output modality given any input modality. The model was trained on around 500 million samples of multimodal data, including COYO and C4. Out of the box, 4M-21 can perform a wide range of tasks, including steerable image generation and image retrieval. On vision benchmarks including semantic segmentation and depth estimation, it outperformed comparable baseline models. According to Apple:
The resulting model demonstrates the possibility of training a single model on a large number of diverse modalities/tasks without any degradation in performance and significantly expands the out-of- the-box capabilities compared to existing models. Adding all these modalities enables new potential for multimodal interaction, such as retrieval from and across multiple modalities, or highly steerable generation of any of the training modalities, all by a single model.
4M-21 builds on Apple's earlier model, Massively Multimodal Masked Modeling (4M), which handled only seven modalities. The new model triples the modalities, which include text and pixel data, as well as "multiple types of image, semantic and geometric metadata." Each modality has a dedicated tokenizer; text modalities use a WordPiece tokenizer, while image modalities use variational auto-encoders (VAE). The model is trained using a single objective: "a per-token classification problem using the cross-entropy loss."
By allowing inputs with multiple modalities and chaining operations, 4M-21 supports fine-grained image editing and generation. For example, providing a text caption input will prompt the model to generate the described image. Users can control details about the generated image by including geometric input such as bounding boxes, segmentation maps, or human poses along with the caption. The model can also perform image retrieval based on different inputs; for example, by finding images given a caption or a semantic segmentation map.
Research team member Amir Zamir posted about the work in a thread on X. One user asked Zamir why the model does not support audio modalities. Zamir replied that "It’s a matter of data," and suggested their method should work with audio. He also wrote:
IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence.
Andrew Ng's AI newsletter The Batch also covered 4M-21, saying:
The limits of this capability aren’t clear, but it opens the door to fine control over the model’s output. The authors explain how they extracted the various modalities; presumably users can do the same to prompt the model for the output they desire. For instance, a user could request an image by entering not only a prompt but also a color palette, edges, depth map extracted from another image, and receive output that integrates those elements.
The code and model weights for 4M-21 are available on GitHub.