Researchers from Adobe and the University of North Carolina (UNC) have open-sourced CLIP-S, an image-captioning AI model that produces fine-grained descriptions of images. In evaluations with captions generated by other models, human judges preferred those generated by CLIP-S a majority of the time.
The model and experiments were described in a paper submitted to the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). CLIP-S uses a Transformer model to generate captions given an input image. During training, the model uses CLIP to determine how well the generated caption describes the image; this score is used as a reward signal for reinforcement learning (RL). To improve the grammar of the generated captions, the team fine-tuned CLIP with negative caption examples, which were generated by randomly modifying reference captions. To address the shortcomings of existing image-captioning evaluation methods, the team also developed a new benchmark dataset, FineCapEval, which includes more fine-grained image captions describing image backgrounds and relations between objects. According to the research team,
The reference captions of public datasets often describe only the most prominent objects in the images. This makes models trained to maximize textual similarity with reference captions tend to generate less distinctive captions that ignore the fine detailed aspects of an image that distinguishes it from others.
Many image captioning models are trained on datasets consisting of input images and reference captions; the training objective measures the similarity of the generated caption to the reference caption, using metrics such as BLEU. However, this often results in models that generate generic captions that describe only the prominent objects in the image, ignoring fine details that make the image distinctive.
To address this problem, the Adobe team chose to use OpenAI's CLIP model to measure the accuracy of the generated captions. CLIP measures the similarity between an image and a text string; the more closely the text describes the image, the higher the similarity. The researchers used this CLIP score to create a reward function, CLIP-S, for RL training to produce their captioning model.
However, the team found that this model often generated grammatically incorrect captions, for example, by repeating words: "several rows of planes parked outside a terminal window area with fog outside a terminal window motion position area motion." Their solution was to fine-tune the text-encoder portion of CLIP, by providing negative examples with randomly repeated, inserted, or shuffled tokens. They also introduced a two-layer perceptron classifier head that detects whether a sentence is grammatically correct, training this jointly with the text-encoder fine-tuning.
The team also created FineCapEval, a new benchmark dataset for evaluating fine-grained image captioning models. This dataset contains 500 images from the MS COCO test split and the Conceptual Captions validation split. For each image, five human workers wrote descriptions of: the image background; the objects in the image, including shape and color; the relationships among the objects, such as spatial relationships; and a detailed caption including all the first three aspects. The dataset contains a total of 1k images with 5k captions for each of those four criteria.
To evaluate their model, the team compared its captions to those from several baseline models, using the COCO dataset as a benchmark. Although a baseline model outperformed CLIP-S on text-based metrics such as BLEU, CLIP-S outperformed on image-text based metrics as well as text-to-image retrieval metrics. It also "significantly" outperformed baselines on the team's new FineCapEval benchmark. Finally, human judges "strongly" preferred captions generated by CLIP-S to those generated by baseline models.
Multimodal image-text AI models are an active research topic. InfoQ recently reported on DeepMind's Flamingo model, which exhibits state-of-the-art few-shot learning capability on several image-text tasks, including image captioning. Last year InfoQ reported on Google's ALIGN model and on AliBaba's M6 model, both of which can perform a variety of image-text tasks.
The CLIP-S code and the FineCapEval dataset are available on GitHub.