BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News NVIDIA Unveils NVLM 1.0: Open-Source Multimodal LLM with Improved Text and Vision Capabilities

NVIDIA Unveils NVLM 1.0: Open-Source Multimodal LLM with Improved Text and Vision Capabilities

NVIDIA unveiled NVLM 1.0, an open-source multimodal large language model (LLM) that performs on both vision-language and text-only tasks. NVLM 1.0 shows improvements in text-based tasks after multimodal training, standing out among current models. The model weights are now available on Hugging Face, with the training code set to be released shortly.

NVLM 1.0 has been evaluated against proprietary and open-access multimodal models and performs well on both vision-language and text-only tasks. In particular, the NVLM-1.0-D 72B model shows an average 4.3-point improvement in accuracy on math and coding tasks after multimodal training. This contrasts with models like InternVL2-Llama3-76B, which lose performance in text-only tasks following multimodal training. The text improvements seen in NVLM suggest that its architecture manages multimodal data effectively without undermining its original language abilities.


Source: https://nvlm-project.github.io/

The NVLM-1.0-D 72B model is not just about text. It handles a wide range of multimodal tasks. These include object localization, reasoning, OCR (optical character recognition), and even coding tasks based on visual inputs. The model can interpret complex scenarios, such as understanding visual humor or answering location-sensitive questions in images. Its ability to perform mathematical reasoning based on handwritten pseudocode, as well as its handling of other multimodal inputs, highlights the breadth of tasks it can manage.

User Imjustmisunderstood reflected on NVLM’s potential for deeper understanding:

Extending tokenization to more ‘senses’ exponentially increases dimensionality. I would be fascinated to see whether the latent space recognizes the common temporal dimension in different modalities.

This touches on the broader implications of working with multimodal data, suggesting that models like NVLM could offer new ways of connecting different types of information.

Overall, NVLM 1.0 got very positive feedback from the community. For example, Luênya dos Santos, shared the following thoughts

NVIDIA’s NVLM-D-72B is a huge leap in AI innovation. NVIDIA's decision to open-source the model is a game-changer, giving smaller teams access to cutting-edge technology and pushing the boundaries of AI development! Really exciting news.

John McDonald added

By making the model weights publicly available and promising to release the training code, Nvidia breaks from the trend of keeping advanced AI systems closed.

NVLM 1.0 is available for the AI community as an open-source AI model, with model weights accessible through Hugging Face. The training code will be released soon, allowing for further exploration of the model’s capabilities.

About the Author

Rate this Article

Adoption
Style

BT