InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Nexa AI Unveils Omnivision: a Compact Vision-Language Model for Edge AI

AI, ML & Data Engineering

Nexa AI Unveils Omnivision: a Compact Vision-Language Model for Edge AI

Dec 03, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Nexa AI unveiled Omnivision, a compact vision-language model tailored for edge devices. By significantly reducing image tokens from 729 to 81, Omnivision lowers latency and computational requirements while maintaining strong performance in tasks like visual question answering and image captioning. The model’s architecture integrates a Qwen-2.5-0.5B language backbone, a SigLIP-400M vision encoder, and an optimized projection layer to ensure seamless processing of multimodal inputs.

Omnivision’s architecture is designed for efficient multimodal processing, featuring three core components. The Qwen-2.5-0.5B model acts as the backbone for processing text inputs, while the SigLIP-400M vision encoder generates image embeddings from input images. This encoder operates at a resolution of 384 with a 14×14 patch size, optimizing visual data extraction. A projection layer then aligns the image embeddings with the token space of the language model using a Multi-Layer Perceptron (MLP), which allows for streamlined visual-language integration.

Source: Nexa AI Blog

A key innovation is its 9x reduction in image tokens, reducing processing requirements without compromising accuracy. For example, Omnivision can generate captions for high-resolution images in under two seconds on a MacBook M4 Pro, requiring less than 1 GB of RAM. To ensure accuracy and reliability, it uses Direct Preference Optimization (DPO), leveraging high-quality datasets to minimize hallucinations and enhance prediction trustworthiness.

The model's training pipeline is structured in three distinct stages. The pretraining phase focuses on aligning visual and textual inputs to establish foundational capabilities. Supervised fine-tuning follows, enhancing the model's ability to interpret context and generate relevant responses. Finally, Direct Preference Optimization (DPO) refines decision-making by minimizing inaccuracies and improving precision in context-specific outputs.

Omnivision has outperformed its predecessor, nanoLLAVA, in benchmark evaluations across datasets like ScienceQA, MM-VET, and POPE. It achieved notable improvements, including a 71.0% accuracy rate on ScienceQA test data and 93.3% accuracy on the POPE benchmark, demonstrating its reliability in complex reasoning tasks.

Source: Nexa AI Blog

Currently, Omnivision is focused on visual question answering and image captioning. However, Nexa AI revealed plans to expand the model’s capabilities to support optical character recognition (OCR). In a recent Reddit discussion, AzLy shared:

Currently, OCR is not one of this model's intended uses. It is mainly for visual question answering and image captioning. However, supporting better OCR is our next step.

Omnivision can be deployed locally using the Nexa-SDK, an open-source framework that supports a wide range of multimodal tasks. The model is still in early development, and the team is actively gathering feedback from users to guide future improvements.

About the Author

Robert Krzaczyński

Robert Krzaczyński is a software engineer who specialises in Microsoft technologies. Daily, he develops software primarily in .NET, but his interests reach much further. Alongside his core expertise, Robert has a deep interest in machine learning and artificial intelligence, continually expanding his knowledge in these cutting-edge fields. He holds a BSc Eng degree in Control Engineering and Robotics and an MSc Eng degree in Computer Science.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Nexa AI Unveils Omnivision: a Compact Vision-Language Model for Edge AI

Write for InfoQ

About the Author

Robert Krzaczyński

This content is in the AI, ML & Data Engineering topic

Related Topics:

Popular in AI, ML & Data Engineering

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter