Researchers from Stanford University have developed a brain-computer interface (BCI) for synthesizing speech from signals captured in a patient's brain and processed by a recurrent neural network (RNN). The prototype system can decode speech at 62 words-per-minute, 3.4x faster than previous BCI methods.
The system was described in a paper published on bioRxiv. Working with a patient who lost speech ability due to amyotrophic lateral sclerosis (ALS), the team used microelectrodes implanted in the patient's brain to capture neural activity signals generated when the patient attempted to speak. These signals were passed to a RNN, specifically a gated recurrent unit (GRU) model, which was trained to decode the neural signals into phonemes for speech synthesis. When trained on a limited vocabulary of 50 words, the system achieved a 9.1% error rate, and a 23.8% error rate on a 125k word vocabulary. According to the researchers:
[We] demonstrated a speech BCI that can decode unconstrained sentences from a large vocabulary at a speed of 62 words per minute, the first time that a BCI has far exceeded the communication rates that alternative technologies can provide for people with paralysis...Our demonstration is a proof of concept that decoding attempted speaking movements from intracortical recordings is a promising approach, but it is not yet a complete, clinically viable system.
The use of deep learning models to interpret human brain activity is an active research area, and InfoQ has covered several BCI projects involving assistive devices. Many of these use sensors that are implanted in a patient's brain, because these provide the best signal quality; in 2019 InfoQ covered a system developed by Meta which uses such signals to allow users to "type" by imagining themselves speaking. InfoQ has also covered systems that use external or "wearable" sensor, such as the one developed by Georgia Tech in 2021, which allows users to control a video game by imagining activity.
The Stanford system uses four microelectrode arrays implanted in the patient's ventral premotor cortex and Broca's area. To collect data for training the RNN, the patient was given a few hundred sentences each day which she "mouthed," or pantomimed speaking, which generated neural signals which were captured by the microelectrodes. Overall, the team collected 10,850 sentences. Using "custom machine learning methods" from the speech recognition domain, the researchers trained the RNN to output a sequence of phonemes.
To evaluate the system, the team had the patient mouth sentences that were never used in training; the test sentences included some using only the 50 word vocabulary as well as the 125k one. The researchers also experimented with adding a language model to the decoder, which improved error rate from 23.8% to 17.4%, and with reducing the time between training and testing the RNN, to eliminate the day-to-day changes in neural activity. Their conclusion was that the system could see "substantial gains in performance" with further work on language modeling and more robust decoding techniques.
Lead researcher Frank Willett posted about the work on Twitter and answered several questions. In response to a question about whether the RNN predicted the next word that would be spoken, Willett replied:
No next word prediction - the language model simply outputs the best explanation of all RNN outputs produced so far.
Willett also said that the team would publish their code and data after the work is "published in a peer-reviewed journal."