Researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab have open-sourced InstructPix2Pix, a deep-learning model that follows human instructions to edit images. InstructPix2Pix was trained on synthetic data and outperforms a baseline AI image-editing model.
The BAIR team presented their work at the recent IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023. They first generated a synthetic training dataset, where the training examples are pairs of images along with an editing instruction to convert the first image to the second. This dataset was used to train an image generation diffusion model. The result is a model that can, given a source image, accept text-based instructions on how to edit the image; for example, given an image of a person riding a horse and the prompt "Have her ride a dragon," it will output the original image with the horse replaced by a dragon. According to the BAIR researchers:
Despite being trained entirely on synthetic examples, our model achieves zero-shot generalization to both arbitrary real images and natural human-written instructions. Our model enables intuitive image editing that can follow human instructions to perform a diverse collection of edits: replacing objects, changing the style of an image, changing the setting, the artistic medium, among others.
Earlier efforts in AI for image editing have often been based on style transfer, and popular text-to-image generation models such as DALL-E and Stable Diffusion also support image-to-image style transfer operations; however, targeted editing with these models is challenging. More recently, InfoQ covered Microft's Visual ChatGPT which can invoke external tools for editing images, given a textual description of the desired edit.
To train InstructPix2Pix, BAIR first created a synthetic dataset. To do this, the team fine-tuned GPT-3 on a small dataset of human-written examples consisting of an input caption, editing instructions, and a desired output caption. Then this fine-tuned model was given a large dataset of input image captions, from which it generated over 450k edits and output captions. The team then fed the input and output captions to a pre-trained Prompt-to-Prompt model, which generated pairs of similar images based on the captions.
InstructPix2Pix Architecture. Image Source: https://arxiv.org/abs/2211.09800
Given this dataset, the researchers trained InstructPix2Pix, which is based on Stable Diffusion. To evaluate its performance, the team compared its output with a baseline model, SDEdit. They used a tradeoff between two metrics: consistency, which is the cosine similarity between the CLIP embeddings of the input image and the edited image; and directional similarity, or how much the change in the edited caption agrees with the change in the edited image. In experiments, for a given value of directional similarity, InstructPix2Pix produced more consistent images than did SDEdit.
In his deep-learning newsletter The Batch, AI researcher Andrew Ng commented on InstructPix2Pix:
This work simplifies — and provides more coherent results when — revising both generated and human-made images. Clever use of pre-existing models enabled the authors to train their model on a new task using a relatively small number of human-labeled examples.
The InstructPix2Pix code is available on GitHub. The model and a web-based demo are available on Huggingface.