Meta recently open-sourced Massively Multilingual & Multimodal Machine Translation (SeamlessM4T), a multilingual translation AI that can translate both speech audio and text data across nearly 100 languages. SeamlessM4T is trained on 1 million hours of audio data and outperforms the current state-of-the-art speech-to-text translation model.
SeamlessM4T is a multimodal model that can handle both text and audio data as input and output, allowing it to perform automated speech recognition (ASR), text-to-text translation (T2TT), speech-to-text translation (S2TT), text-to-speech translation (T2ST), and speech-to-speech translation (S2ST). The model is released under the non-commercial CC BY-NC 4.0 license. Meta is also releasing their training dataset, SeamlessAlign, which contains 270,000 hours of audio data with corresponding text transcription, as well as their code for mining the data from the internet. According to Meta,
We believe the work we’re announcing today is a significant step forward....Our single model provides on-demand translations that enable people who speak different languages to communicate more effectively. We significantly improve performance for the low and mid-resource languages we support. These are languages that have smaller digital linguistic footprints....This is only the latest step in our ongoing effort to build AI-powered technology that helps connect people across languages. In the future, we want to explore how this foundational model can enable new communication capabilities—ultimately bringing us closer to a world where everyone can be understood.
Meta's motivation for their research is to build a universal translation system like the Babelfish from The Hitchhiker’s Guide to the Galaxy sci-fi stories. InfoQ has covered several of their previous efforts, including their T2TT model No Language Left Behind (NLLB) which can translate text between 200 languages, and their Massively Multilingual Speech (MMS) model which supports ASR and text-to-speech synthesis (TTS) in over 1,100 languages. InfoQ also covered other work in the area, such as OpenAI's Whisper which can transcribe and translate speech audio from 97 different languages, Google's Universal Speech Model (USM) that supports ASR in over 100 languages, and Google's AudioPaLM, which was the previous state-of-the-art model for S2ST.
SeamlessM4T is based on the UnitY neural network architecture, which consists of a pipeline of three components. First is an encoder that can handle both speech audio and text data input and recognizes the input's meaning; the audio sub-component is based on w2v-BERT and the text on NLLB. Next is a decoder, also based on NLLB, which converts that meaning into a text output in a target language. Finally, there is a text-to-acoustic unit decoder to convert the target text into speech.
Meta compared their model's performance to both cascaded approaches, which consist of a pipeline of discrete ASR, T2TT, and TTS models, and to single-model systems. The systems were evaluated on the FLEURS and CVSS benchmarks. On FLEURS, SeamlessM4T "sets a new standard for translations into multiple target languages," outperforming AudioPaLM by 20%. SeamlessM4T also outperformed cascaded models; on CVSS it was "stronger by 58%."
Several users discussed SeamlessM4T on Hacker News. One user shared tips on how to get the model to run locally, and pointed out that it had a context limit of 4096 tokens. Another user asked:
Will there be a whispercpp equivalent? Half the reason I love whisper is how dead simple it is to get running. I will take somewhat lower accuracy for easier operation.
The SeamlessM4T code and models are available on GitHub. There is an interactive translation demo available on Huggingface.