Meta AI recently announced Make-A-Video, a text-to-video generation AI model. Make-A-Video is trained using publicly available image-text pairs and video-only data and achieves state-of-the-art performance on the UCF-101 video-generation benchmark.
The model and a set of experiments were described in a paper published on arXiv. Unlike some other text-to-video (T2V) models, Make-a-Video does not require a dataset of text-video pairs. Instead, it is based on existing text-image pair models, which generate single-frame images from a text description. Generated images are expanded in both spatial and temporal dimension using an additional series of neural network layers. Make-a-Video outperformed other T2V models on both automatic benchmarks as well as evaluations by human judges. According to Meta,
We want to be thoughtful about how we build new generative AI systems like this. Make-A-Video uses publicly available datasets, which adds an extra level of transparency to the research. We are openly sharing this generative AI research and results with the community for their feedback, and will continue to use our responsible AI framework to refine and evolve our approach to this emerging technology.
InfoQ has covered the release of several text-to-image (T2I) AI models, including DALL-E, Imagen, and Stable Diffusion. Given a text description, these models can generate photorealistic images of the scene described in the text input. They are trained on large datasets of text-image pairs, usually scraped from the internet. The obvious extension of this idea to video generation is to collect a large dataset of text-video pairs for training.
The Meta researchers instead opted to use a pre-trained encoder based on CLIP, which they call the prior, as the base of their model. This converts the input text into an image embedding, followed by a decoder that converts that image embedding into a series of 16 frames of 64x64 pixel images. The team then used unsupervised learning on video data without text labels to learn a model to upsample the generated images to a higher frame rate and pixel resolution.
To evaluate the model, the researchers measured its zero-shot performance on two video-understanding benchmarks: UCF-101 and MSR-VTT. On both benchmarks, Make-a-Video outperformed Tsinghua University's CogVideo. When fine-tuned for UCF-101, Make-a-Video set a new state-of-the-art performance record. The team also used human judges to compare the outputs of Make-a-Video and CogVideo, using prompts from the DrawBench benchmark as well as a new benchmark set Meta developed for their experiments. The judges rate Make-a-Video's output as having more realistic motion a majority of the time.
As with the recent T2I models, Make-a-Video has been greeted with a mix of amazement and apprehension. In a Hacker News thread about the model, one user wrote:
As an owner of a video production studio, this kind of tech is blowing my mind and makes me equally excited and scared. I can see how we could incorporate such tools in our workflows, and at the same time I'm worried it'll be used to spam the internet with thousands and thousands of soulless generated videos, making it even harder to look through the noise.
Several other research organizations have developed similar T2V systems recently, many of them based on T2I models. Earlier this year, Tsinghua University released CogVideo, claimed to be the "first open-source large-scale pretrained text-to-video model." More recently, Google released two T2V systems, Imagen Video, based on the Imagen T2I model, and Phenaki, which can produce videos lasting several minutes and can handle prompts that change over time.