Google researchers have introduced MusicLM, an AI model that can generate high-fidelity music from text. MusicLM creates music at a constant 24 kHz throughout a number of minutes by modeling the conditional music generating process as a hierarchical sequence-to-sequence modeling problem.
According to the research paper, MusicLM was trained on a dataset of 280,000 hours of music to produce songs that make sense for complex descriptions. The researchers also claim their model outperforms previous systems both in audio quality and adherence to the text description.
MusicLM samples, includes five-minute pieces produced from only one or two words like melodic techno, as well as 30-second samples that sound like entire songs and are formed from paragraph-long descriptions that prescribe a genre, vibe, and even specific instruments.
MusicLM is also capable of transforming a collection of sequentially written descriptions into a musical story or narrative built on existing melodies, whether they are whistled, hummed, sung, or played on an instrument.
AI-generated music has a long history and has been credited with writing hit songs, and enhancing live performances. In a more recent version, written prompts are converted into spectrograms and music using the AI picture generating engine Stable Diffusion.
Contrary to text-to-image machine learning, where it is claimed that large datasets have contributed significantly to recent advancements, there are hurdles for AI music related to the absence of coupled audio and text data. For instance, Stable Diffusion and OpenAI's DALL-E tool have both sparked a surge in interest from the general public. Also the fact that music is structured along a temporal dimension presents another difficulty in AI music generation. Consequently, compared to using a description for a still image, it is far more difficult to convey the intention of a music track using simple text.
Google is being more cautious with MusicLM than some of its competitors may be with comparable technology, as it has been with prior excursions into this form of AI. The article ends with the statement, "We have no plans to disclose models at this point".