Researchers from Microsoft, the University of Wisconsin–Madison, and Columbia University have open-sourced Large Language and Vision Assistant (LLaVA). LLaVA is based on a CLIP image encoder and a LLaMA language decoder, is fine-tuned on a synthetic instruction-following dataset, and achieved state-of-the-art accuracy on the ScienceQA benchmark.
The researchers used GPT-4 to generate the instruction-following dataset, which contains virtual conversations between a human user and an AI assistant about the content of images. This dataset was used to fine-tune the LLaVA model, which consists of two foundation models: CLIP for vision and LLaMA for language, with an additional network layer to tie the two together. The team also used GPT-4 to evaluate LLaVA's responses in experiments, by asking it to rate LLaVA's output on a scale of 1 to 10. When further fine-tuned on the ScienceQA training dataset, LLaVA achieved an accuracy of 92.53%, a new record for the benchmark. According to the researchers,
This paper demonstrates the effectiveness of visual instruction tuning using language-only GPT-4. We have presented an automatic pipeline to create language-image instruction-following data, based on which we train LLaVA, a multimodal model to follow human intent to complete visual tasks. It achieves [an] excellent visual chat experience when fine-tuned on multimodal chat data.
The technique of fine-tuning large language models (LLMs) with instruction-following datasets has led to gains in performance, as demonstrated by ChatGPT, and has prompted researchers to explore this technique with smaller LLMs. InfoQ recently reported on LLaMA, which has only 7B parameters compared to GPT-3's 175B, but can outperform GPT-3 on many tasks. The next step in development of AI assistants has been the addition of the ability to handle image data, as shown with the release of GPT-4 and Visual ChatGPT.
The LLaVA team's goal was to train a model end-to-end with visual instruction tuning. To do this, the researchers started with images drawn from the COCO dataset. Because the images are annotated with captions and object bounding boxes, the team fed this data into a text-only GPT-4 along with prompts asking GPT-4 to output instruction-following data, including: imagined conversations between a person and an assistant, questions about the details of the image content, and questions requiring reasoning about the image content. Overall, the generated dataset contains 158K samples.
LLaVA Architecture. Image Source: https://arxiv.org/abs/2304.08485
The LLaVA architecture consists of a CLIP foundation model followed by a projection matrix layer to convert images into a word embedding space; textual input is also transformed into the same space. The image and word tokens are then passed to a LLaMA decoder which produces output. A pre-training process first trains the project matrix, and then a fine-tuning process updates both the projection layer and the LLaMA decoder weights; the CLIP weights are frozen.
LLaVA co-author Chunyuan Li answered several questions about the work on Twitter. When some users compared LLaMA to MiniGPT-4, Li pointed out that LLaMA could reproduce image-based results from the GPT-4 paper, which MiniGPT-4 could not. He also said:
LLaVA has rigorous quantitative results, including the level of similarity with Visual Chat and GPT-4, the SoTA accuracy on Science QA, and ablation studies on data iteration and model design. Mini GPT-4, on the other hand, lacks quantitative results....Last, it should be clarified the focus of this line of work is data-centric, not model-centric. As the differences in models are diminishing, data quality has a greater impact on results. We released our multi-modal instruction following data, to replicate Multimodal GPT-4
The LLaVA source code is available on GitHub, and an interactive demo is available on the project site. The LLaVA training data and model weights are available on Huggingface. The model uses delta weights on top of LLaMA and "should not be used outside of research purposes."