"Target speech hearing" is a new deep-learning algorithm developed at the University of Washington to allow users to "enroll" a speaker and cancel all environmental noise surrounding their voice.
Currently, the system requires the person wearing the headphones to tap a button while gazing at someone talking or just look at them for three to five seconds. This directs a deep learning model to learn the speaker's vocal patterns and latch to it so it can play it back to the listener even as they move around and stop looking at that person.
A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker.
The key during the enrollment step is that the wearer is looking in the direction of the speaker, so their voice is aligned across the two binaural microphones while other interfering speakers are likely not aligned. This example is used to train a neural network with the characteristics of the target speaker and extract the corresponding embedding vector. This is then used with a different neural network to extract the target speech from a cacophony of speakers.
According to the researchers, this constitutes a significant step forward compared to existing noise-canceling headphones, which can effectively cancel out all sounds but not selectively pick speakers based on their speech traits.
To make this possible, the team had to solve several problems, including optimizing the state-of-the-art speech separation network TFGridNet to make it run in real-time on embedded CPUs, finding a training methodology to use synthetic data to build a system capable of generalizing to real-world unseen speakers, and others.
Shyam Gollakota, one of the researchers behind "semantic hearing", highlights that their project differs from current approaches to AI in that it aims to modify people's auditory perception using on-device AI without relying on Cloud-based services.
At the moment, the system can enroll only one speaker at a time. Another limitation is that enrollment will succeed only if no other loud voices are coming from the same direction, but the user can run another enrollment on the speaker to improve the clarity if they are not satisfied with the initial result.
The team has open-sourced their code and dataset to facilitate future research work to improve target speech hearing.