InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News University of Chinese Academy of Sciences Open-Sources Multimodal LLM LLaMA-Omni

AI, ML & Data Engineering

University of Chinese Academy of Sciences Open-Sources Multimodal LLM LLaMA-Omni

This item in japanese

Oct 08, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Researchers at the University of Chinese Academy of Sciences (UCAS) recently open-sourced LLaMA-Omni, an LLM that can operate on both speech and text data. LLaMA-Omni is based on Meta's Llama-3.1-8B-Instruct LLM and outperforms similar baseline models while requiring less training data and compute.

The LLaMa-Omni architecture extends Llama-3 by including a speech encoder at the input and a speech decoder at the output. Compared to other schemes where standalone speech recognition (SR) and text-to-speech (TTS) modules are used in series with an LLM, this architecture reduces the latency between an input speech prompt and output speech generation. The model is fine-tuned on InstructS2S-200K, a custom dataset created by the UCAS team, which has 200 thousand speech prompts and their expected speech replies. According to the researchers:

Experimental results show that, compared to [baseline] speech-language models, LLaMA-Omni delivers superior responses in both content and style, with a response latency as low as 226ms. Moreover, training LLaMA-Omni requires less than 3 days on 4 GPUs, enabling rapid development of speech interaction models based on the latest LLMs. In the future, we plan to explore enhancing the expressiveness of generated speech responses and improving real-time interaction capabilities.

The research team evaluated LLaMa-Omni's performance on two tasks: speech-to-text instruction-following (S2TIF) and speech-to-speech instruction-following (S2SIF), and compared it to other baseline models, including Qwen2-Audio. The evaluation dataset was a subset of Alpaca-Eval, with a total of 199 prompts; the team also fed the prompts into a TTS system to generate speech-based prompts.

The team used GPT-4o to automatically score each model's output, judging it on content (whether the output achieves the user's instruction) and style (whether the output is suited for speech interaction). On the S2TIF task, LLaMA-Omni outperformed the baselines on style, and on the S2SIF task, it outperformed on both content and style.

In a discussion about LLaMa-Omni on Hacker News, one user pointed out the benefits of an end-to-end model for speech and text, vs a cascaded system of standalone components:

Essentially, there's data loss from audio -> text. Sometimes that loss is unimportant, but sometimes it meaningfully improves output quality. However, there are some other potential fringe benefits here: improving the latency of replies, improving speaker diarization, and reacting to pauses better for conversations.

Users on Reddit also commented on the model, especially its use of OpenAI's Whisper model for speech encoding:

[T]heir input approach is similar to how LLaVA added image understanding by training a glue layer for Llama and CLIP. LLaMA-Omni takes whisper as their encoder like LLaVA takes CLIP. Then the embeddings are projected into the feature space of their underlying Llama model. I didn't immediately understand their voice output architecture so I can't comment on that.

The integration of speech I/O into LLMs is a growing trend. Earlier this year, InfoQ covered the release of OpenAI's GPT-4 omni, which is a version of GTP-4 that is trained end-to-end to handle speech data. InfoQ also covered Alibaba's open-weight Qwen2-Audio, which can handle speech input but only outputs text.

The LLaMa-Omni model files are available on Huggingface.

About the Author

Anthony Alford

Anthony is a Senior Director, Development at Genesys where he is working on several AI and ML projects related to customer experience. He has over 20 years experience in designing and building scalable software. Anthony holds a Ph.D. degree in Electrical Engineering with specialization in Intelligent Robotics Software and has worked on various problems in the areas of human-AI interaction and predictive analytics for SaaS business optimization.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

University of Chinese Academy of Sciences Open-Sources Multimodal LLM LLaMA-Omni

Write for InfoQ

About the Author

Anthony Alford

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter