Sora is OpenAI's new generative AI model to create videos from textual prompts. Currently in preview, the new model is able to create photorealistic videos up to 60 seconds long leveraging its ability to understand how things exist in the real world and combining multiple shots together without character or style disruption.
We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.
According to OpenAI, Sora can build highly detailed scenes, including complex camera motion and multiple characters. From a technical point of view, Sora is a diffusion model. Its starting point is a video looking like static noise which is then gradually transformed into the final result by removing the noise step by step.
We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.
OpenAI highlights a challenging problem they solved in Sora, namely keeping the subject the same even when it goes out of view temporarily and preserving the visual style, by letting the model operate on many frames at a time, which brings it some ability to know what will happen in advance and plan for it.
OpenAI showed several impressive videos created using Sora, including historical footage of California during the gold rush, a stylish woman walking down a Tokyo street, golden retrievers playing in the snow, and others. Anyway, some generated videos may show physically implausible motion, OpenAI admits, as shown in a video showing a man walking on a conveyor belt in the wrong direction or another where sand morphs into a chair and displays counter-intuitive motion.
Currently, the new model is not open to the general public yet as OpenAI is at work to improve its safety. This entails, for example, rejecting text input prompts that include extreme violence, sexual content, hateful imagery, or infringing on third-party IP or celebrity privacy rights. To this aim, OpenAI says it is working with experts in areas like misinformation, hateful content, and bias to test the limits of the model.
Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.
OpenAI also plans to apply the safety methods they built for DALL-E-3 to Sora, as well as C2PA metadata to detect videos created through AI.
Sora is not the first text-to-video generation AI model to enter the market. Other solutions include Runway, Pika, Stability AI, Google Lumiere, and others.
As several commentators on Hacker News pointed out, demo videos produced by OpenAI are "most certainly cherry-picked" to show the model at its best and results could be very different when trying to create a video from a very specific idea. Additionally, videos created by initial adopters appear to be of minor quality and detail. This does not impinge, though, on Sora's impressiveness and the momentum it can generate in the text-to-video generation arena.