InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Meta Spirit LM Integrates Speech and Text in New Multimodal GenAI Model

AI, ML & Data Engineering

Meta Spirit LM Integrates Speech and Text in New Multimodal GenAI Model

This item in japanese

Oct 31, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Presented in a recent paper, Spirit LM enables the creation of pipelines that mix spoken and written text to integrate speech and text in the same multimodal model. According to Meta, their novel approach, based on interleaving text and speech tokens, makes it possible to circumvent the inherent limitations of prior solutions that use distinct pipelines for speech and text.

Meta's new model is based on a 7B pre-trained text-only language model (Llama 2) extended to include speech. To this aim, the model is continuously trained on both text and speech units.

Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus.

According to Meta, Spirit LM reunites the semantic abilities you have come to expect from text LLMs and the expressive abilities of speech models. However, as we will explain later on, Spirit LM's text-only performance is currently slightly inferior to Llama 2's.

The usual approach to extend LLMs to support speech input and output, explained by Meta's researchers, consists of building a pipeline where speech is transcribed using automatic speech recognition (ASR) to text, which is then fed into an LLM, whose output is finally converted to speech. This is the approach taken by GPT-4o and Hume's EVI 2, which also claims to be able to generate an emotionally inflected voice. However, say Meta's researchers:

With such pipelines, modeling and generating expressive speech is constrained out of the language model, leading to poor generation from an expressive point of view.

Spirit LM is instead trained on a mix of text-only sequences, speech-only sequences, and interleaved sequences. Speech is converted into tokens that represent phonetic units (HuBERT) as well as pitch and style units. This enables the creation of interleaved training sequences by randomly changing from the text to speech modality at word boundaries.

One of the major findings in Meta's research is that Spirit LM can learn new tasks, similar to text LLMs, and is able to preserve the sentiment of text and speech prompts. The latter claim is based on a new benchmark Meta researchers introduced, dubbed Speech-Text Sentiment Preservation, which consists of generating a speech or text sequence of tokens and checking if it preserves the sentiment of the prompt, pre-classified as displaying a positive, negative, or neutral sentiment.

As mentioned, according to the researchers themselves, Spirit LM does not perform as well as the base Llama 2 model for text prompts, which is a limitation they hope to solve by refining the training. Another step in the evolution of Spirit LM is adopting a larger model as its base, which could lead to further performance improvement.

As a final note, Spirit LM is a foundational model, and thus, it does not include any provisions to make it safe against misuse, such as generating fake news, spam, or impersonating specific speakers. Likewise, Spirit LM is only trained for English and does not include a variety of accents and dialects for underrepresented groups.

Spirit LM is available in two versions. The base version only uses speech phonetic units (HuBERT), while the expressive version uses pitch and style units as well. The model is available on GitHub along with its weights, but its license only permits non-commercial usage.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over twenty five years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last 10+ years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and macOS development.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Meta Spirit LM Integrates Speech and Text in New Multimodal GenAI Model

Write for InfoQ

About the Author

Sergio De Simone

This content is in the AI, ML & Data Engineering topic

Related Topics:

Popular in AI, ML & Data Engineering

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter