Stability AI recently released Stable Video 3D (SV3D), an AI model that can generate 3D-mesh object models from a single 2D image. SV3D is based on the Stable Video Diffusion model and produces state-of-the-art results on 3D object generation benchmarks.
SV3D addresses the problem of Novel View Synthesis (NVS), which tries to generate the unseen portions of an object given one or more 2D images of that object: for example, generating a view of the back of an object given an image of its front. Stability AI leveraged their existing Stable Video Diffusion model, which includes camera control abilities, allowing it to generate orbital videos, where the camera makes a circle around the object of interest. This model was fine-tuned using a dataset rendered from 3D objects in the Objaverse dataset. When evaluted on the GSO and OmniObject3D benchmarks, SV3D outperformed baseline models and achieved new state-of-the-art performance. According to Stability AI:
Stable Video 3D introduces significant advancements in 3D generation, particularly in novel view synthesis. Unlike previous approaches that often grapple with limited perspectives and inconsistencies in outputs, Stable Video 3D is able to deliver coherent views from any given angle with proficient generalization. This capability not only enhances pose-controllability, but also ensures consistent object appearance across multiple views, further improving critical aspects of realistic and accurate 3D generations.
InfoQ covered SV3D's underlying technology, Stable Video Diffusion (SVD), when it was released in 2023. Stability AI released an earlier attempt at 3D generation also in late 2023: Stable Zero123. This was based on their Stable Diffusion 1.5 text-to-image model. This effort was inspired by the open-source Zero123 3D generation model created by the Allen Institute for AI (AI2) as part of the Objaverse project.
One shortcoming of the Zero123 and Stable Zero123 approach is that those models "are not designed to be multi-view consistent," since they only create novel views one frame at a time and thus lack the "most critical requirement" of 3D generation. By contrast, the SVD model is explicitly trained to generate consistent multi-frame videos. An additional advantage is that it allows for camera control, which can produce more than simple orbital 3D videos. According to Stability AI, "To the best of our knowledge, SV3D is the first video diffusion-based framework for controllable multi-view synthesis at 576x576 resolution."
To train SV3D, Stability AI used the objects in Objaverse and rendered 21 frames of each object from different camera angles. They trained three versions of SV3D: one trained with static orbits only, one with dynamic orbits only, and a third with both static and dynamic orbits. The third model achieved better metrics on evaluations than did the other two.
In a Hacker News thread about SV3D, users discussed possible applications of the model. One user wrote:
If the animations shown are representative, then the mesh output may very well be good enough to use in a 3d printer. Looking forward to experimenting with this.
The SV3D model weights are available on Huggingface, for non-commercial uses only. The model is available for commercial use via Stability AI's membership program.