A team at Google Research developed a zero-shot voice transfer (VT) model that can be used to customize a text-to-speech (TTS) with a specific person's voice. This allows speakers who have lost their voice, for example from Parkinson's disease or ALS, to use a TTS device to replicate their original voice. The model also works across languages.
The model supports few-shot and zero-shot operation, requiring only a few seconds of reference speech audio to replicate a voice. This is a key feature for speakers who may not have "banked" several audio samples of their voice before losing it. A speaker-encoder uses a spectrogram of voice audio to create an embedding vector representation of the voice; the embedding is then passed to the decoder stage of Google's modular TTS system. In experiments, the Google team found that the system can work across languages, producing speech in a language the reference speaker does not speak.
Speech therapist Richard Cave wrote about the work on X, saying:
Here is a stunning example of where synthetic approximation of natural speech is going - and such wonderful use cases! Exciting times.
The new VT model is based on a TTS system that Google developed that is trained on multilingual "found" data: data which include text-only data, speech-text paired data, and untranscribed speech data. This system can perform TTS in over 100 languages. The system uses a text encoder to convert text data to a sequence of tokens. The tokens are then passed to a duration predictor that creates a different sequence that matches the expected duration of the output audio. Finally this is passed to a decoder which applies acoustic features; the VT is done by this decoder.
Voice Transfer Model Architecture. Image Source: Google Research Blog
Google performed experiments where human judges were given pairs of audio samples, one from a real human speaker ("reference" speech) and one generated by the VT model. The judges were asked to decide if the samples were from the same speaker, and 76% of the time the judges thought they were. They performed a similar experiment, where the judges were native speakers of a language besides English. The audio pairs included reference speech in English and generated speech in the judge's native language. The judges thought the speakers were the same 73% of the time.
AI-enabled voice transfer is an active research topic and InfoQ has covered several VT systems recently. In 2023, InfoQ wrote about Microsoft's VALL-E, which can replicate a voice after three seconds of audio recording; Meta's Voicebox, which can produce speech in six languages, as well as edit and remove noise from speech recordings; and Google's AudioPaLM, which can perform TTS, automated speech recognition (ASR), and speech-to-speech translation (S2ST) with voice transfer. Earlier this year, InfoQ covered Amazon's BASE TTS which supports voice-cloning.
The ability of AI models to clone voices raises concerns for misuse. In the case of Google's new VT model, the researchers added audio watermarking to the output: "imperceptible information within the synthesized audio waveform" that can be detected by software.