InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

AI, ML & Data Engineering

LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

Nov 24, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Researchers from several Chinese institutions fine-tuned Llama-3.2-11B-Vision-Instruct to improve its ability to solve multimodal reasoning problems by going beyond the direct-response or chain-of-thought (coT) approaches to reason step by step in a structured way. Named LLava-CoT, the new model outperforms its base model and proves better than larger models, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct, on a number of benchmarks.

According to the researchers, one reason why visual language models (VLMs) often hallucinate or produce errors is the lack of systematic and structured reasoning:

Specifically, by referring systematic, we mean that the model does not generate a direct reasoning chain but instead engages in multistage reasoning. Structured, on the other hand, refers to the model’s ability to clearly identify the reasoning stage it is in and understand the primary task to be addressed at each stage.

The research authors' approach consists of designing LLaVA-CoT so it reasons through four stages: a summary, where the model summarizes the current task; a caption, which describes the relevant parts of an image; reasoning, where the model analyzes the question; and conclusion, which provides a final response based on the reasoning stage. In other words, the model first organizes the problem and all known information then carries through a detailed thought process and finally derives a conclusion.

To make this possible, the researchers constructed a specific dataset, LLaVA-o1-100k, by using GPT-4o to generate responses stage by stage. The custom dataset includes data from both general-purpose visual question answer (VQA) datasets as well as science-targeted VQA datasets. They used then the generated dataset to perform a full parameter fine-tuning of Llama-3.2-11B-Vision-Instruct in a supervised approach.

Additionally, LLaVA-CoT uses a novel approach to efficient inference time scaling. Instead of using beam search at the sentence level, they use it at this stage level to generate multiple candidate results at each stage. The best potential result is then selected to continue the generation process at the next stage. According to the authors, using inference time scaling makes it possible for the model to arrive at a concrete answer during the reasoning process and retain it for the final stage. Lacking this, the model could need to make a guess for the final stage, possibly leading to incorrect results.

Stage-level beam search, which is made possible by the structured output design of [LLaVA-CoT], is an effective and powerful approach for inference time scaling.

To assess their approach, the researchers compared LLaVA-CoT performance to both its base model and other models. They found LLaVA-CoT provides notable improvements across general VQA, mathematical reasoning, scientific VQA, and hallucination control tasks in comparison to its base model. Additionally, LLaVA-CoT appears to outperform many open-source models of similar or even larger sizes, such as InternVL2-8B, Ovis1.5-Gemma2-9B, MiniCPM-V2.6-8B, Llama-3.2-90B-Vision-Instruct, and VILA-1.5-40B, as well as closed-source models such as GPT-4o-mini and Gemini-1.5-pro.

LLaVA-Cot is available on Hugging Face, while the LLaVA-o1-100k dataset will be made public in future, say the authors. A Web app is also available which allows to upload an image and start chatting about it.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

Write for InfoQ

About the Author

Sergio De Simone

This content is in the AI, ML & Data Engineering topic

Related Topics:

Popular in AI, ML & Data Engineering

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter