OpenAI has trained a 12B-parameter AI model based on GPT-3 that can generate images from textual description. The description can specify many independent attributes, including the position of objects as well as image perspective, and can also synthesize combinations of objects that do not exist in the real world.
Researcher Aditya Ramesh gave an overview of the system and its capabilities in a recent blog post. The model is based on the Transformer architecture used in GPT-3; unlike GPT-3, however, the model input includes image pixels as well as text. It is able to produce realistic-looking images based on short captions that specify multiple objects, their colors, textures, and respective positions, and other contextual details such as lighting or camera angle. The model also exhibits behavior that its designers did not anticipate, including the ability to do image-to-image transfer tasks such as style transfer. OpenAI named the model "DALL·E," a mashup of Pixar's robot WALL·E and the artist Salvador Dali, perhaps because of its ability to produce images from surreal combinations of objects; for example, "an armchair in the shape of an avocado."
Source: https://openai.com/blog/dall-e/
Many popular deep-learning models for image generation use a generative-adversarial network (GAN) architecture. In 2018, researchers at NVIDIA created the StyleGAN model which generated photorealistic images of human faces; this led to the creation of a popular website that serves high-resolution photos of people who do not exist, and a number of variations. In 2020, OpenAI released Image GPT (iGPT), a Transformer-based model that operates on sequences of pixels instead of sequences of text. OpenAI found that, just as GPT models for text could generate realistic samples of natural language, iGPT could "generate coherent image completions and samples," given an input of initial pixels.
OpenAI also recently released CLIP, another deep-learning model that combines GPT's natural language capabilities with computer vision. CLIP is pre-trained on a dataset of images paired with text scraped from the internet and can perform several different visual classification tasks via zero shot transfer learning. For example, CLIP can match the performance of the original ResNet50 model on the ImageNet benchmark, without being trained on any of the ImageNet images. CLIP also performs well on the ImageNet-Adversarial benchmark, scoring 77% accuracy; by contrast, ResNet50 achieves only 2.7%.
DALL·E is a Transformer model that is given an input that consists of 256 text tokens and 1024 image tokens. The model contains 64 self-attention layers with a total of 12B parameters. DALL·E generates output images autoregressively, and OpenAI uses CLIP to rank the quality of the generated images. While OpenAI's blog includes several sample images and the ability to interactively generate new images by changing some of the words in the input description, they have not published the complete details of the system, nor released the code or pre-trained model. The blog notes that company intends to provide more details about model architecture and training and plans to analyze "the longer term ethical challenges implied by this technology."
Other prominent AI research organizations have also recently applied Transformer models to computer vision. In 2019, Microsoft published a paper on UNiversal Image-TExt Representation Learning (UNITER), which is based on a Transformer architecture and achieves state of the art performance on visual/language tasks, including visual question answering (VQA) and image-text retrieval. In 2020, the Allen Institute for AI published a paper on X-LXMERT which performs VQA and image generation.
OpenAI's code and models for iGPT and CLIP are available on GitHub. Although DALL·E has not been released, AI researchers at EleutherAI have open-sourced their code for a similar system.