logo
ResearchBunny Logo
A high-performance speech neuroprosthesis

Engineering and Technology

A high-performance speech neuroprosthesis

F. R. Willett, E. M. Kunz, et al.

This groundbreaking study reveals a high-performance speech-to-text brain-computer interface that significantly enhances accuracy and speed, even for participants with ALS, achieving a remarkable 9.1% error rate on a 50-word vocabulary. The innovative research conducted by Francis R. Willett and colleagues uncovers the intricate relationship between speech articulators and cortical representation, paving the way for future advancements.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether high-resolution intracortical recordings can enable accurate, rapid decoding of attempted speech—sufficient for unconstrained, large-vocabulary communication in people with severe dysarthria due to paralysis. The broader context is that prior speech BCIs have shown promise but have not achieved low word error rates on large vocabularies or near-conversational speeds. To investigate cortical organization of speech production and enable decoding, the team recorded from four microelectrode arrays implanted in a participant with bulbar-onset ALS who retains limited orofacial movement and vocalization but cannot produce intelligible speech. Arrays were placed in ventral premotor cortex (area 6) and primary motor cortex (area 4). Tasks involved attempted single orofacial movements, phonemes, words, and then sentences, with the goal of characterizing neural tuning and building a real-time speech-to-text decoder.
Literature Review
The paper situates its contribution against earlier demonstrations of speech BCIs that could decode limited vocabularies or constrained sentences but with insufficient accuracy for general communication. It notes that prior communication BCIs based on hand movement achieved 8–18 words per minute in people with paralysis, and that the previous speed record for speech BCIs was substantially lower than the 62 words per minute achieved here. The authors also reference prior implications of primary motor cortex in speech production and existing safety and longevity work on intracortical microelectrode arrays, while emphasizing the gap in large-vocabulary, low-error-rate speech decoding and conversational speeds.
Methodology
Participants and implants: One participant (T12) with bulbar-onset ALS was enrolled in a BrainGate pilot clinical trial. Four 96-channel microelectrode arrays were implanted: two in ventral premotor cortex (area 6) and two in primary motor cortex (area 4). Implant locations were guided by Human Connectome Project multimodal cortical parcellation and MR-derived anatomy. Neural signals included multi-channel threshold crossings and spike-band power. Tasks and data collection: The participant performed instructed tasks involving attempted orofacial movements, single phonemes, and words to assess tuning. For real-time speech decoding, she attempted to speak sentences prompted on a monitor. At the start of each RNN evaluation day, 260–480 sentences (approximately 21.4 ± 3.7 minutes of data) were collected at the participant’s pace. Over days, the training dataset accumulated to 10,850 sentences. Daily data collection and RNN training took ~140 minutes including breaks. Initial decoding analyses: Population activity from area 6 showed strong separability across movement categories. A naive Bayes classifier on trial-wise population activity decoded: orofacial movements (high accuracy; e.g., among 33 conditions), 39 phonemes (~62% accuracy), and 50 words (~43% accuracy). Area 6 contributed the most informative signals; subsequent analyses focused on this region. Neural decoding architecture: Real-time decoding used a recurrent neural network (RNN) to convert a time sequence of neural features (aligned and smoothed per electrode at 80 ms time steps) into phoneme probability distributions, including probabilities for a silence token and a CTC blank. The RNN was a five-layer gated recurrent neural architecture trained with connectionist temporal classification (CTC) loss using TensorFlow 2.0. Continuous learning strategies accounted for non-stationarities: (1) day-specific input layers accommodated across-day neural changes; (2) rolling feature adaptation addressed within-day drift. The phoneme posterior sequence was combined with task-appropriate language models to infer the most probable word sequence. Two language models were used: a large-vocabulary model (125,000 words) and a small 50-word model. The large-vocabulary decoder was implemented in Kali. Evaluation sentences were held out from training and never duplicated. Online evaluation procedure: For each trial, an instructed delay cued preparation, followed by an automatic ‘go’ trigger starting neural decoding. As the participant attempted to speak, decoded words were displayed in real time; the participant pressed a button to finalize the output upon completion. Two speaking modes were evaluated: vocalized attempts and silent (non-vocalized) attempts. Representational analyses: To examine preserved articulatory representations, the team derived ‘saliency vectors’—neural activity patterns that maximized RNN phoneme probabilities—and compared these to articulatory measurements from electromagnetic articulography (EMA) in able-bodied speakers. Analyses assessed consonant structure (e.g., place of articulation) and the two-dimensional vowel space (height and front–back), including low-dimensional embeddings to compare neural and articulatory geometries. Offline and robustness analyses: Error rates were computed using a bootstrap percentile method (10,000 resamples over trials). Offline reprocessing tested improved language models and proximal train–test splits (to mitigate within-day non-stationarity), and varying amounts of same-day training data to assess daily retraining needs. Additional analyses examined gradual neural changes over time and feasibility of unsupervised decoder updates.
Key Findings
- Neural tuning and separability: Area 6 (ventral premotor cortex) exhibited strong, spatially intermixed tuning to orofacial movements, phonemes, and words. Simple classifiers achieved high separability, including approximately 92% accuracy for orofacial movements, ~62% for 39 phonemes, and ~43% for 50 words, indicating rich articulatory information in a small cortical region. - Real-time large-vocabulary decoding: First demonstration of decoding unconstrained sentences from a 125,000-word vocabulary in a person with paralysis, achieving an online mean word error rate (WER) of 23.8% (95% CI: 21.8–25.9) when vocalizing attempted speech; silent attempts yielded 24.7% (22.0–27.4) WER. - Small-vocabulary performance: With a 50-word vocabulary, online WER was 9.1% (7.2–11.2) when vocalizing and 11.2% (8.3–14.4) when silent. - Phoneme error rates (reflecting RNN output before language model): Online mean phoneme error rates ranged from ~19.7% to 22.1% across conditions (e.g., 19.7% [18.6–20.9] for 125k vocal; 20.9% [19.3–22.6] for 125k silent; 21.4% [19.6–23.2] for 50-word vocal; 22.1% [19.9–24.3] for 50-word silent). - Speed: Decoding operated at 62 words per minute, 3.4 times faster than the previous record and approaching natural conversation rates (~160 wpm). - Offline improvements: With improved language modeling and a proximal test set, offline mean WERs decreased to 17.4% (15.4–19.5) and, with further optimization, to 11.8% (9.8–13.9); phoneme error rate in one offline condition decreased to 17.0% (15.7–18.3). - Training burden and robustness: Offline analyses suggested reasonable performance even without same-day retraining (e.g., ~30% WER with no new-day data). Neural activity drifted gradually, indicating feasibility of unsupervised adaptation to maintain performance. - Preserved articulatory representations: Neural ‘saliency vectors’ for consonants correlated with EMA-derived articulatory structure (correlation r ≈ 0.61), especially when ordered by place of articulation; nasal consonants clustered together, and other detailed structures emerged. Vowel representations reflected the canonical two-dimensional articulatory organization (height and front–back). These findings indicate detailed articulatory coding persists years after onset of paralysis. - Spatial considerations: Accurate decoding was possible from a relatively small cortical area (approximately a few millimeters across), due to spatially intermixed tuning to multiple articulators, informing design principles for future high-channel-count arrays targeting ventral premotor cortex.
Discussion
The findings demonstrate that high-resolution intracortical recordings from ventral premotor cortex can support accurate, rapid decoding of attempted speech, directly addressing the challenge of enabling unconstrained, large-vocabulary communication for people with paralysis. The combination of an RNN phoneme decoder with strong language models yielded the first successful large-vocabulary (125,000-word) online decoding in this population, at a speed (62 wpm) that substantially surpasses hand-based BCIs and narrows the gap to natural conversation. The observed spatially intermixed tuning to multiple articulators within a small area and the close match between neural and articulatory structures (for consonants and vowels) provide neurophysiological support for focusing implants on ventral premotor cortex. These neural coding features, coupled with modern sequence modeling and adaptive training strategies, allowed substantial error-rate reductions relative to prior state of the art. Offline analyses indicated that improved language models and better handling of non-stationarities can further reduce WER to near 10–12%, underscoring the promise of algorithmic and interface improvements. The results are relevant for the field by showing that conversational-rate communication via speech BCIs is feasible with current intracortical technology and machine learning, while highlighting the importance of channel count, robust adaptive decoding to neural drift, and carefully chosen cortical targets. The work also suggests a path toward generalizable, clinically practical speech neuroprostheses.
Conclusion
This study provides a proof of concept that attempted speech can be decoded from intracortical neural activity into unconstrained text from a large vocabulary at unprecedented speeds (62 wpm) and with substantially improved accuracy (e.g., 23.8% online WER for 125k words; 9.1% for a 50-word set). It further shows that ventral premotor cortex contains rich, spatially intermixed articulatory representations that resemble articulatory features, enabling effective decoding from a small cortical region. Future work should: (1) increase channel counts and optimize array placement to reduce WER further; (2) advance language models and real-time decoding pipelines; (3) develop robust unsupervised or minimally supervised adaptation to day-to-day and within-day neural non-stationarities; (4) validate across more participants, including those with profound orofacial weakness; (5) evaluate long-term stability, safety, and clinical workflows to minimize daily training burdens; and (6) refine surgical targeting to reliably access speech-informative cortex across individuals.
Limitations
- Generalizability: Results are from a single participant; broader validation is required to ensure applicability across individuals and etiologies, including those with more profound orofacial weakness. - Training burden and non-stationarity: Daily collection of 260–480 sentences and retraining were used to maintain performance; although offline analyses suggest feasibility without same-day data, practical systems must reduce calibration time and handle neural drift robustly. - Error rates for everyday use: An online WER around 24% for 125k words is likely insufficient for daily communication compared to state-of-the-art speech-to-text systems (~4–5%); further improvements are needed. - Hardware considerations: Longevity, stability, and scalability of intracortical microelectrode arrays require continued validation; targeting variability across individuals may impact reliability. - Cortical coverage: Arrays sampled limited cortical regions; while strong tuning in ventral premotor cortex was observed, the role of other regions and optimal configurations remain to be established.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny