Salesforce Research recently open-sourced LAnguage-VISion (LAVIS), a unified library for deep-learning language-vision research. LAVIS supports more than 10 language-vision tasks on 20 public datasets and includes pre-trained model weights for over 30 fine-tuned models.
The release was announced on the Salesforce Research blog. LAVIS features a modular design that allows for easy integration of new models and provides standard interfaces for model inference. The built-in models, which are trained on public datasets, allow researchers to use LAVIS as a benchmark for evaluating their own original work; the models could also be used as-is in an AI application. LAVIS also includes other tools, such as utilities and GUIs for downloading and browsing common public training datasets. According to the Salesforce team:
[We built LAVIS] to make accessible the emerging language-vision intelligence and capabilities to a wider audience, promote their practical adoptions, and reduce repetitive efforts in future development.
Multimodal deep-learning models, especially ones that perform combined language-vision tasks, is an active research area. InfoQ has covered several state-of-the-art models such as OpenAI's CLIP and DeepMind's Flamingo. However, the Salesforce researchers note that training and evaluating such models can be difficult for new practitioners, because of inconsistencies "across models, datasets and task evaluations." There have been other efforts to create toolkits similar to LAVIS, notably Microsoft's UniLM and Meta's MMF and TorchMultimodal.
LAVIS supports language-vision task in seven different categories: end-to-end pre-training, multimodal retrieval, captioning, visual question answering, multimodal classification, visual dialogue, and multimodal feature extraction. These tasks are performed by fine-tuned models based on four different foundation models, which include OpenAI's CLIP as well as three models developed by Salesforce: ALign BEfore Fuse (ALBEF), Bootstrapping Language-Image Pretraining (BLIP), and ALign and PROmpt (ALPRO).
Image Source: https://blog.salesforceairesearch.com/lavis-language-vision-library/
The figure above shows the high-level architecture of LAVIS. In addition to models, the library exposes pre-processors for text and image input, which are applied before passing data to the model. The code below shows an example of using LAVIS to generate captions for an image input:
import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']
Besides pre-trained models, LAVIS includes several tools for researchers interested in developing new models. This includes scripts for downloading training and test datasets, code for training and evaluating the included models, as well as benchmark results on the test datasets. The documentation includes tutorials on how to add new modules to the library, including datasets, pre-processors, models, and tasks.
Salesforce is actively developing LAVIS, and recently added a new visual question answering model, PNP-VQA. Instead of fine-tuning models, this new framework focuses on zero-shot learning. A pre-trained image model generates "question-guided" captions, which are passed to a pre-trained language model as the context for answering the questions. According to Salesforce, "With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2."
The LAVIS source code is available on GitHub, as are several Jupyter notebooks demonstrating its use in a variety of language-vision tasks. A web-based demo of the GUI is said to be "coming soon."