Medicine and Health
A high-performance speech neuroprosthesis
F. R. Willett, E. M. Kunz, et al.
Speech BCIs aim to translate neural activity from attempted speech into text or sound, offering faster communication than motor-based BCIs for people with paralysis. However, prior systems have not achieved high accuracy for unconstrained sentences from large vocabularies, limiting practical utility. Moreover, the single-neuron organization of orofacial movement and speech production in motor cortex remains unresolved. This study investigates the neural representation of speech articulators and production in areas 6v (ventral premotor cortex) and 44 (part of Broca’s area) at single-neuron resolution, and demonstrates real-time speech-to-text decoding of attempted speech from intracortical spiking activity across small (50 words) and large (125,000 words) vocabularies. The central hypotheses are that high-resolution intracortical recordings enable accurate, rapid decoding of phonemes and words from attempted speech, and that articulatory representations remain preserved after paralysis.
Previous speech decoding work has translated neural activity into text or synthesized speech using ECoG and other modalities, but accuracies have been insufficient for unconstrained, large-vocabulary communication (Herff et al., Moses et al., Anumanchipalli et al., Herff et al., Kellis et al., Pei et al., Mugler et al.). Motor-based BCIs enabled typing at 8–18 WPM (Pandarinath et al.) and handwriting-based BCIs achieved 18 WPM (Willett et al.). Natural conversation occurs around 160 WPM (Yuan et al.). Traditional views implicate Broca’s area (area 44) in speech production, but recent evidence questions its direct role in articulation and production (Tate et al., Flinker et al., Gajardo-Vidal et al., Andrews et al.). Functional organization of speech articulation across sensorimotor cortex has been shown using ECoG (Bouchard et al.). Modern speech recognition leverages deep RNNs and large language models (Hinton et al., Graves et al., Xiong et al.), and BCI robustness to neural variability can be improved via unsupervised/adaptive methods (Dyer et al., Farshchian et al., Degenhart et al., Karpowicz et al., Sussillo et al.). Intracortical technologies continue to mature, with safety and high-channel devices under development (Rubin et al., Musk & Neuralink, Sahasrabuddhe et al., Nurmikko; Vázquez-Guardado et al.).
Participant: Single participant (T12), a 67-year-old left-handed woman with bulbar-onset ALS enrolled in BrainGate2. She retains limited orofacial movement and vocalization but cannot produce intelligible speech. Arrays and recording: Four 64-channel microelectrode arrays implanted—two in area 6v (ventral premotor cortex) and two in area 44 (Broca’s area), localized using Human Connectome Project multimodal cortical parcellation; resting-state fMRI network mapping indicated ventral 6v’s involvement in language networks. Neural signals were band-pass filtered (250–5000 Hz), threshold crossings detected at −4.5 RMS; spike band power also extracted. Instructed-delay tasks: Evaluated neural tuning to 33 orofacial movements, 39 phonemes, and 50 words. A Gaussian naive Bayes classifier decoded movement categories from 1 s of neural population activity per trial; accuracies and confusion matrices were computed for 6v vs 44. Real-time sentence decoding: An online 5-layer gated recurrent unit (GRU) RNN (TensorFlow 2) emitted phoneme probabilities every 80 ms using temporally binned and smoothed features (threshold crossings and spike band power). The RNN output included phonemes plus silence and blank tokens (CTC training). Phoneme probabilities were combined with a trigram language model (Kaldi) for word-level decoding. Two language models were used: large vocabulary (125,000 words; general English) and small vocabulary (50 words; Moses et al. word set). Training: At the start of each day, T12 attempted 260–480 randomly selected Switchboard sentences (41 ± 3.7 min; day duration ~140 min including breaks). RNNs were trained on that day’s data plus prior days (10,850 total sentences by final day). Architecture adaptations addressed neural non-stationarity: day-specific input layers compensated across-day changes; rolling feature adaptation (e.g., rolling z-scoring) addressed within-day drifts. Evaluation: Held-out sentences (never seen in training for the large vocabulary) were decoded online; T12 prepared during a delay, then spoke upon ‘go’ cue; words appeared in real time; she finalized outputs via button press. Both vocalized and silent (mouthed) speaking modes were tested across multiple days (5 vocalized, 3 silent). Phoneme error rates were computed on raw RNN output, and word error rates on the combined RNN+LM outputs. Additional analyses: Contribution of arrays and electrodes evaluated offline by ablating electrodes and comparing ventral vs dorsal 6v arrays. Offline improvements assessed via enhanced language models and proximal test sets (training and testing closer in time) to mitigate within-day non-stationarity. Articulatory representation analysis: Derived phoneme-specific neural “saliency” vectors from the trained RNN to maximize phoneme probability; compared representational similarity to electromagnetic articulography (EMA) in able-bodied speakers (USC-TIMIT, Haskins datasets). Consonants analyzed by place of articulation; vowels by first/second formant structure (high–low; front–back). Statistical methods included bootstrap CIs, ANOVA for tuning (P<1×10⁻⁵), Procrustes alignment in low-dimensional representations, and null distributions via label shuffling.
Neural encoding and classification: Area 6v exhibited strong, intermixed tuning to orofacial movements, phonemes, and words; area 44 contained little to no production-related information. Using naive Bayes on 1 s population activity, area 6v achieved 92% accuracy for 33 orofacial movements, 62% for 39 phonemes, and 94% for 50 words; area 44 stayed below 12% across categories. Ventral 6v carried more speech information, especially during instructed delay; dorsal 6v had more orofacial movement information. Speech BCI performance: Online WERs (95% CI): 50-word vocabulary—vocal 9.1% (7.2–11.2), silent 11.2% (8.3–14.4); 125,000-word vocabulary—vocal 23.8% (21.8–25.9), silent 24.7% (22.0–27.4). Phoneme error rates: vocal 19.7% (18.6–20.9), silent 20.9% (19.3–22.6). Speaking rate averaged 62 WPM, >3× faster than prior BCI records (18 WPM). The RNN often produced sensible phoneme sequences even without the language model; phoneme substitution errors aligned with articulatory similarity. Array contribution: Combining ventral and dorsal 6v arrays lowered WER (offline reduction from 32% to 21% by adding dorsal to ventral). Most electrodes contributed redundantly; removal of any single electrode caused <1% phoneme error increase, with the most informative electrodes concentrated in ventral 6v. Offline ceilings: Improved LM reduced large-vocabulary WER from 23.8% to 17.4%; proximal test sets further reduced WER to 11.8%; phoneme ER improved to 17.0%. Design analyses: Vocabulary size—accuracy gains of small vocabularies dissipated by ~1,000 words; intermediate vocabularies did not offer sustained improvement. Electrode count—log-linear improvement; doubling electrodes reduced error by ~43% (factor ~0.57). Training set—reasonable performance even with no same-day training (offline WER ~30%), suggesting feasibility of unsupervised adaptation.
This work demonstrates, for the first time to our knowledge, large-vocabulary decoding of unconstrained sentences from attempted speech using intracortical spiking activity, at 62 WPM—exceeding communication rates of alternative assistive technologies (e.g., eye tracking) and prior BCIs. The findings address the research question by showing that the ventral premotor cortex (area 6v) contains a rich, spatially intermixed articulatory representation sufficient for accurate, rapid decoding from a small cortical area, while area 44 provides minimal articulatory information, aligning with updated views on Broca’s area. The preserved detailed articulatory code for consonants and vowels years after paralysis supports the viability of speech BCIs for individuals unable to speak intelligibly. Although current WERs (~24% online for a 125k-word vocabulary) are above state-of-the-art speech-to-text systems (4–5%), offline analyses indicate that significant improvements are achievable via better language models, increased channel counts, and algorithms robust to neural non-stationarities. Overall, high-resolution intracortical recordings coupled with modern sequence models and language modeling provide a realistic path toward clinically meaningful, conversational-speed communication for people with paralysis.
Intracortical recordings from area 6v enable high-speed, high-accuracy decoding of attempted speech, achieving the first large-vocabulary demonstration of unconstrained sentence decoding at 62 WPM and markedly improved small-vocabulary accuracy. The study reveals an intermixed, strong articulatory representation at single-neuron resolution and a preserved, detailed phoneme code post-paralysis. Future work should focus on reducing daily training demands, implementing unsupervised adaptation to neural drift, improving language models, increasing channel counts and coverage within ventral premotor cortex, validating across more participants (including those with profound orofacial weakness), and refining surgical targeting to reliably capture speech-relevant regions. These directions aim to reduce WERs to clinically acceptable levels and realize practical speech neuroprostheses for rapid communication.
The system is a proof-of-concept, not yet clinically viable. Online large-vocabulary WER (~24%) remains higher than conventional speech-to-text (4–5%), which may limit everyday use. Decoder training currently requires substantial per-day data (260–480 sentences) and time, and performance can be impacted by within-day and across-day neural non-stationarities. Generalizability is unproven: results derive from a single participant with some residual orofacial movement; applicability to individuals with more severe orofacial weakness and across diverse brain anatomies is uncertain. Intracortical microelectrode array technology requires further demonstrations of longevity, reliability, and scalability; surgical targeting of speech-relevant cortex may vary across individuals. Area 44 arrays provided minimal usable production information, constraining signal sources. Ethical, safety, and device maturation considerations remain for widespread clinical adoption.
Related Publications
Explore these studies to deepen your understanding of the subject.

