Researchers at the University of Chinese Academy of Sciences (UCAS) recently open-sourced LLaMA-Omni, an LLM that can operate on both speech and text data. LLaMA-Omni is based on Meta's Llama-3.1-8B-Instruct LLM and outperforms similar baseline models while requiring less training data and compute.
The LLaMa-Omni architecture extends Llama-3 by including a speech encoder at the input and a speech decoder at the output. Compared to other schemes where standalone speech recognition (SR) and text-to-speech (TTS) modules are used in series with an LLM, this architecture reduces the latency between an input speech prompt and output speech generation. The model is fine-tuned on InstructS2S-200K, a custom dataset created by the UCAS team, which has 200 thousand speech prompts and their expected speech replies. According to the researchers:
Experimental results show that, compared to [baseline] speech-language models, LLaMA-Omni delivers superior responses in both content and style, with a response latency as low as 226ms. Moreover, training LLaMA-Omni requires less than 3 days on 4 GPUs, enabling rapid development of speech interaction models based on the latest LLMs. In the future, we plan to explore enhancing the expressiveness of generated speech responses and improving real-time interaction capabilities.
The research team evaluated LLaMa-Omni's performance on two tasks: speech-to-text instruction-following (S2TIF) and speech-to-speech instruction-following (S2SIF), and compared it to other baseline models, including Qwen2-Audio. The evaluation dataset was a subset of Alpaca-Eval, with a total of 199 prompts; the team also fed the prompts into a TTS system to generate speech-based prompts.
The team used GPT-4o to automatically score each model's output, judging it on content (whether the output achieves the user's instruction) and style (whether the output is suited for speech interaction). On the S2TIF task, LLaMA-Omni outperformed the baselines on style, and on the S2SIF task, it outperformed on both content and style.
In a discussion about LLaMa-Omni on Hacker News, one user pointed out the benefits of an end-to-end model for speech and text, vs a cascaded system of standalone components:
Essentially, there's data loss from audio -> text. Sometimes that loss is unimportant, but sometimes it meaningfully improves output quality. However, there are some other potential fringe benefits here: improving the latency of replies, improving speaker diarization, and reacting to pauses better for conversations.
Users on Reddit also commented on the model, especially its use of OpenAI's Whisper model for speech encoding:
[T]heir input approach is similar to how LLaVA added image understanding by training a glue layer for Llama and CLIP. LLaMA-Omni takes whisper as their encoder like LLaVA takes CLIP. Then the embeddings are projected into the feature space of their underlying Llama model. I didn't immediately understand their voice output architecture so I can't comment on that.
The integration of speech I/O into LLMs is a growing trend. Earlier this year, InfoQ covered the release of OpenAI's GPT-4 omni, which is a version of GTP-4 that is trained end-to-end to handle speech data. InfoQ also covered Alibaba's open-weight Qwen2-Audio, which can handle speech input but only outputs text.
The LLaMa-Omni model files are available on Huggingface.