Harmonai, the audio research lab of Stability AI, has released Stable Audio, a diffusion model for text-controlled audio generation. Stable Audio is trained on 19,500 hours of audio data and can generate 44.1kHz quality audio in realtime using a single NVIDIA A100 GPU.
Similar to Stability AI's image generation model Stable Diffusion, Stable Audio takes as input a user's text prompt describing the desired output. As with Stable Diffusion, a U-Net-based diffusion model forms the core of the system. Besides the text prompt, users can also specify the desired output length in seconds. The model can generate the sound of single instruments, a full ensemble, or more ambient sounds such as crowd noise. According to Stability AI:
Stable Audio represents the cutting-edge audio generation research by Stability AI’s generative audio research lab, Harmonai. We continue to improve our model architectures, datasets, and training procedures to improve output quality, controllability, inference speed, and output length.
Recent advancements in generative AI for text and images have also spurred development of music generating models. OpenAI's MuseNet, which is based on GPT-2, generates a sequence of MIDI notes which can be converted to sound using MIDI synthesizer software. This year, InfoQ covered Google's MusicLM and Meta's MusicGen models, which operate similarly to autoregressive language models, but they output "audio tokens" instead of text tokens. In 2022, several diffusion-based music generation models appeared, including Dance Diffusion, an earlier effort by Harmonai; and Riffusion, a project that uses a fine-tuned version of Stable Diffusion to generate spectrogram images which are converted to sound using classic digital signal processing techniques.
Stable Audio uses a pre-trained model called CLAP to map the user's text prompt into an embedding space that is shared with musical features, similar to the way OpenAI's CLIP is used in Stable Diffusion. These feature vectors, as well as embeddings for the desired output length and a noise vector, are fed into the 970M parameter denoising U-Net model, which is based on a system called Moûsai. This outputs a latent-space representation of the generated sound, which is then converted to audio via a variational autoencoder (VAE) called Descript Audio Codec.
Several users on X (formerly Twitter) commented on the release of Stable Audio. Ezra Sandzer-Bell, founder of AudioCipher, linked to a:
Detailed guide on how to use Stable Audio, including text-to-music prompting tips.... We've identified and summarized some of the most important Terms of Service, to help you stay out of trouble.
Stability AI CEO Emad Mostaque wrote:
This is the first commercially licensed music model and platform, amazing work by team. This is still in the experimental phase but expect it to advance rapidly so you can create any audio you can imagine, plus integrate your own data and more.
Although Stable Audio is not currently open-source, Harmonai says they will release "open-source models based on Stable Audio" as well as code for training custom models. The Harmonai Github account contains a fork of the Moûsai repository. The Stable Audio website allows users to sign up for a free tier, which gives them up to 20 generations per month with a non-commercial use restriction.