BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Develops Voice Transfer AI for Restoring Voices

Google Develops Voice Transfer AI for Restoring Voices

A team at Google Research developed a zero-shot voice transfer (VT) model that can be used to customize a text-to-speech (TTS) with a specific person's voice. This allows speakers who have lost their voice, for example from Parkinson's disease or ALS, to use a TTS device to replicate their original voice. The model also works across languages.

The model supports few-shot and zero-shot operation, requiring only a few seconds of reference speech audio to replicate a voice. This is a key feature for speakers who may not have "banked" several audio samples of their voice before losing it. A speaker-encoder uses a spectrogram of voice audio to create an embedding vector representation of the voice; the embedding is then passed to the decoder stage of Google's modular TTS system. In experiments, the Google team found that the system can work across languages, producing speech in a language the reference speaker does not speak.

Speech therapist Richard Cave wrote about the work on X, saying:

Here is a stunning example of where synthetic approximation of natural speech is going - and such wonderful use cases! Exciting times.

The new VT model is based on a TTS system that Google developed that is trained on multilingual "found" data: data which include text-only data, speech-text paired data, and untranscribed speech data. This system can perform TTS in over 100 languages. The system uses a text encoder to convert text data to a sequence of tokens. The tokens are then passed to a duration predictor that creates a different sequence that matches the expected duration of the output audio. Finally this is passed to a decoder which applies acoustic features; the VT is done by this decoder.

Google Voice Transfer Architecture

Voice Transfer Model Architecture. Image Source: Google Research Blog

Google performed experiments where human judges were given pairs of audio samples, one from a real human speaker ("reference" speech) and one generated by the VT model. The judges were asked to decide if the samples were from the same speaker, and 76% of the time the judges thought they were. They performed a similar experiment, where the judges were native speakers of a language besides English. The audio pairs included reference speech in English and generated speech in the judge's native language. The judges thought the speakers were the same 73% of the time.

AI-enabled voice transfer is an active research topic and InfoQ has covered several VT systems recently. In 2023, InfoQ wrote about Microsoft's VALL-E, which can replicate a voice after three seconds of audio recording; Meta's Voicebox, which can produce speech in six languages, as well as edit and remove noise from speech recordings; and Google's AudioPaLM, which can perform TTS, automated speech recognition (ASR), and speech-to-speech translation (S2ST) with voice transfer. Earlier this year, InfoQ covered Amazon's BASE TTS which supports voice-cloning.

The ability of AI models to clone voices raises concerns for misuse. In the case of Google's new VT model, the researchers added audio watermarking to the output: "imperceptible information within the synthesized audio waveform" that can be detected by software.

About the Author

Rate this Article

Adoption
Style

BT