Stability AI released the code and model weights for Stable Video Diffusion (SVD), a video generation AI model. When given an input image as context, the model can generate 25 video frames at a resolution of 576x1024 pixels.
The model is based on Stability's Stable Diffusion text-to-image generation model, with additional video pre-training and fine-tuning using a high-quality curated dataset. To perform this additional training, Stability collected a dataset called Large Video Dataset (LVD), which contains 580M video clips representing 212 years of runtime. While the initial model release only supports image-to-video generation, Stability AI claims it can be adapted for multiple video generation tasks, including text-to-video and multi-view (i.e., 3D object) generation; the company has also announced a waitlist to gain access to a web-based text-to-video interface. The model license allows use for research purposes only:
While we eagerly update our models with the latest advancements and work to incorporate your feedback, we emphasize that this model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release.
Stability AI's general strategy for building SVD was to collect and annotate a large dataset of videos. Starting with raw video, the team first removed motion inconsistencies such as "cuts" as well as videos with no motion at all. They then applied three synthetic captions to each clip using an image-only caption model, a video caption model, and an LLM to combine the two. They also used CLIP to extract aesthetic scores for selected frames in the video samples.
After training a base video diffusion model on the large dataset, the researchers used smaller curated datasets to fine-tune task-specific models for text-to-video, image-to-video, frame-interpolation, and multi-view generation. They also trained LoRA camera-control blocks for the image-to-video model. When evaluated by human judges, the output of the image-to-video model was preferred over that generated by state-of-the-art commercial products GEN-2 and PikaLabs. The multi-view generation model outperformed state-of-the-art models Zero123 and SyncDreamer.
Emad Mostaque, Stability AI's CEO, wrote about the model's current and future capabilities on X:
It [has] not only camera control via LoRA, you can do explosions & all sorts of effects...We will have blocking, staging, mise en scene, cinematography & all other elements of scene creation & brand new ones...
In a discussion about SVD on Hacker News, one user pointed out shortcomings of this approach:
[A]lthough I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that. However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc. And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.
The Stable Video Diffusion code is available on GitHub, and the model weights are available on Huggingface.