Presented in a recent paper, Spirit LM enables the creation of pipelines that mixes spoken and written text to integrate speech and text in the same multimodal model. According to Meta, their novel approach, based on interleaving text and speech tokens, makes it possible to circumvent the inherent limitations of prior solutions that use distinct pipelines for speech and text.
Meta's new model is based on a 7B pre-trained text-only language model (Llama 2) extended to include speech. To this aim, the model is continuously trained on both text and speech units.
Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus.
According to Meta, Spirit LM reunites the semantic abilities you have come to expect from text LLMs and the expressive abilities of speech models. However, as we will explain later on, Spirit LM text-only performance is currently slightly inferior to Llama 2's.
The usual approach to extend LLMs to support speech input and output, explain Meta's researchers, consists of building a pipeline where speech is transcribed using automatic speech recognition (ASR) to text, which is then fed into an LLM, whose output is finally converted to speech. This is the approach taken by GPT-4o and Hume's EVI 2, which also claims to be able to generate emotionally inflected voice. However, say Meta's researchers:
With such pipelines, modeling and generating expressive speech is constrained out of the language model, leading to poor generation from an expressive point of view.
Spirit LM is instead trained on a mix of text-only sequences, speech-only sequences, and interleaved sequences. Speech is converted into tokens that represent phonetic units (HuBERT) as well as pitch and style units. This enables creating interleaved training sequences by randomly changing from the text to speech modality at word boundaries.
%IMAGE 1%
One of the major findings in Meta's research is Spirit LM can learn new tasks, similarly to text LLMs, and is able to preserve the sentiment of text and speech prompts. The latter claim is based on a new benchmark Meta researchers introduced, dubbed Speech-Text Sentiment Preservation, which consists in generating a speech or text sequence of tokens and check if it preserves the sentiment of the prompt, pre-classified as displaying a positive, negative, or neutral sentiment.
As mentioned, according to the researchers themselves, Spirit LM does not perform as well as the base Llama 2 model for text prompts, which is a limitation they hope to solve by refining the training. Another front of evolution for Spirit LM is adopting a larger model as its base, which could lead to a further performance improvement.
As a final note, Spirit LM is a foundational model and thus it does not include any provisions to make it safe against misuse, such as generating fake news, spam, or impersonating specific speakers. Likewise, Spirit LM is only trained for English and does not include a variety of accents and dialects for underrepresented groups.
Spirit LM is available in two versions. The base version only uses speech phonetic units (HuBERT) while the expressive version uses pitch and style units as well. The model is available on GitHub along with its weights, but its license only permits non-commercial usage.