logo
ResearchBunny Logo
A streaming brain-to-voice neuroprosthesis to restore naturalistic communication

Medicine and Health

A streaming brain-to-voice neuroprosthesis to restore naturalistic communication

K. T. Littlejohn, C. J. Cho, et al.

Abstract text is not available in the record. This research was conducted by Kaylo T. Littlejohn, Cheol Jun Cho, Jessie R. Liu, Alexander B. Silva, Bohan Yu, Vanessa R. Anderson, Cady M. Kurtz-Miott, Samantha Brosler, Anshul P. Kashyap, Irina P. Hallinan, Adit Shah, Adelyn Tu-Chan, Karunesh Ganguly, David A. Moses, Edward F. Chang, and Gopala K. Anumanchipalli. Listen to the audio to hear the full abstract and the key insights straight from the authors.... show more
Introduction

This work presents a streaming brain-to-voice neuroprosthesis that decodes attempted speech from electrocorticography (ECoG) to naturalistic audio and text in real time. The study targets individuals with severe paralysis and anarthria, for whom conventional augmentative and alternative communication (AAC) systems are slow and burdensome. The central hypothesis is that stable, information-rich ECoG signals over speech motor and language cortices can support fast, accurate, naturalistic communication via modern sequence transduction and speech synthesis models. A related clinical protocol explores feasibility of long-term ECoG-based interfaces for motor and communication restoration.

Literature Review

Prior studies indicate limits of noninvasive EEG for high-speed communication due to low signal-to-noise and muscle artifacts, while intracortical spikes face long-term stability challenges. ECoG offers a promising middle ground: robust, long-term recordability, sufficient SNR, and rich information about articulation and speech processing. Foundational work demonstrated ECoG-based movement and speech decoding, and clinical systems have shown durable implantation for epilepsy and responsive stimulation. The AAC landscape often relies on gaze tracking, which may be impractical for advanced ALS or brainstem stroke. Recent large ASR models (e.g., Whisper) enable automatic evaluation of synthesized speech quality. Modern neural sequence models (RNN-T, CTC, Transformers) support streamable decoding without strict alignment, and GAN vocoders (HiFi-GAN derivatives) provide high-fidelity speech synthesis.

Methodology

Participants and clinical protocol: A Phase I, single-center, early feasibility clinical study implants PMT Subdural Cortical Electrodes (128 or 253 channels) over dominant speech and sensorimotor cortices, bonded to Blackrock pedestals and interfaced with the Digital NeuroPort system. Phased testing focuses first on signal stability and real-time control optimization, then on communication and assistive robotics control. Safety oversight includes DSMB and adverse event reporting. Neural decoding framework: A streamable RNN-T architecture predicts either subword text tokens or discrete speech units from neural inputs. The model comprises a transcriber (modality-specific causal encoder), a causal predictor over previous outputs, and a joiner combining both; training maximizes the likelihood via RNNTLoss. Beam search is adapted for streaming with parameters: K=5 (speech) or 20 (text), J=1 (speech) or 10 (text), M=20 (both). CTC baselines with CNN-Transformer encoders are also trained for full-context comparisons. Bimodal pretraining: Predictors and joiners are pretrained on LibriSpeech (960 h) ASR tasks; speech models predict HuBERT discrete units from mel spectrogram inputs; text models use subword BPE tokens. Pretrained predictors are frozen during neural transducer training; pretrained joiner weights initialize the ECoG models. Neural transcribers and training:

  • ECoG (real-time streaming): CNN + uni-LSTM/GRU encoders; augmentation includes random channel drop (HGA/LFS), temporal window jitter. 1024-word-General model trained with Adam (lr 1e-4, β1=0.9, β2=0.999), batch sizes 64 (general) and 32 (50-phrase AAC), hundreds of epochs on NVIDIA A6000; inference on V100.
  • MEA (competition dataset, silent speech): Streaming transcriber uses Gaussian smoothing + unfold (kernel size 6, stride 2), linear projections, 3-layer uni-GRU, 1D CNN; full-context adds session layer and BiGRU.
  • EMG (silent speech dataset): Streaming transcriber uses four ResBlocks (hidden 1024), uni-GRU (4 layers), 1D CNN; full-context uses CNN feature extractor (3 ResBlocks) and uni-GRU (3 layers). RNN-T loss weighting typically favors discrete speech units (e.g., 0.7 speech, 0.3 text). Speech synthesis: A personalized, streamable HiFi-CAR vocoder (autoregressive extension of HiFi-GAN) conditioned on discrete speech-unit embeddings (768-d) and recent audio context (previous 512 samples) generates 16 kHz waveforms. A duration predictor restores de-duplicated HuBERT unit sequences from ~39 Hz back to 50 Hz; a constant factor (×7) simulates slower speaking rate. Vocoder trained adversarially with multi-period and multi-scale discriminators; reconstruction (mel L1), feature matching, and adversarial losses are combined (L_G = L_M + 45*(2*L_F + L_A)). Training on LJSpeech (20.3 h) converted to the participant’s pre-injury voice via denoised 3-second clip and YourTTS voice conversion; HiFi-GAN weights initialized from LibriTTS-pretrained models; MultiStepLR schedule, batch size 32. Incremental text-to-speech: VITS TTS (LJSpeech) used to synthesize full phrases on each update while segmenting and playing only newly generated portions, controlled by duration predictors. Participant-selected speaking rate via length scaling factor 1.61. Evaluation:
  • Perceptual assessment via Amazon Mechanical Turk: 12 raters per trial transcribe decoded audio; median CER transcript selected; preprocessing removes punctuation and special tokens.
  • Automatic speech recognition: Whisper large v2 transcribes synthesized audio; rule-based truncations (≤9 words, ≤2 repeated characters per word) mitigate rare ASR artifacts; longer transcripts allowed for multimodality generalization. Multimodality datasets:
  • ECoG (current participant): 9,368 training sentences (excluding test session), 100 held-out validation, 200 unseen test trials for speech synthesis.
  • MEA: 9680 samples; training 9420 (7560 vocalized, 1860 silent), validation 100 silent, test 160 silent; features include spike band power and threshold crossings (128 electrodes, area 6v), z-scored within blocks.
  • EMG: 8653 samples; after filtering and cleaning, training 7800 (6529 voiced, 1271 silent), validation 196 silent, test 98 silent; signal processing includes band-stop filters at 60 Hz harmonics, high-pass at 2 Hz, subsampling to 689.06 Hz, rescaling, soft de-spiking; training excludes utterances >20000 samples. CTC models: CNN-Transformer encoders per modality with dual heads (speech/text), log-softmax outputs; PyTorch CTC with zero_infinity=True; modality-specific architectures and schedulers; no external language model in beam search.
Key Findings

Participant’s baseline AAC performance (Tobii Dynavox): 8.61 words/min and 34.1 chars/min without autocomplete; 14.2 words/min and 56.4 chars/min with autocomplete; speech synthesis latency 23.2 ± 3.66 s. ASR vs perceptual evaluation (Figure S1; Table S12): No significant differences between Whisper ASR transcripts and crowdsourced perceptual transcripts across phoneme, word, and character error rates (two-tailed Wilcoxon signed-rank, Holm-Bonferroni corrected; ns > .05). Observed median error rates using ASR:

  • 50-phrase AAC: PER 9.23% (99% CI [3.74, 16.4]), WER 12.9% (99% CI [5.26, 25.0]), CER 8.80% (99% CI [3.13, 14.3]).
  • 1024-word General: PER 49.9% (99% CI [42.1, 61.6]), WER 62.4% (99% CI [53.6, 75.6]), CER 48.8% (99% CI [40.2, 61.5]). Real-time decoding above chance (Tables S1, S2): Significant improvements over chance for both text-to-speech and direct speech decoding across PER/WER/CER.
  • 50-phrase AAC: corrected p=3.662e-04 across all metrics and modalities.
  • 1024-word General: corrected p=1.172e-02 across all metrics and modalities. Decoding speed (Table S3): Streaming decoding words-per-minute significantly exceeds prior benchmark (Metzger ’23) for both datasets (p≈1.0e-43 for 50-phrase AAC; p≈1.5e-32 for 1024-word General). Long-form decoding (Table S4): For 1024-word General, bootstrapped median metrics significantly outperform chance for both speech and text across PER/WER/CER (corrected p=1.156e-164). Multimodality generalization (Table S5): Streaming and full-context RNN-T and CTC models significantly outperform chance across ECoG, MEA, and EMG for PER/WER/CER with modality-specific corrected p-values (e.g., ECoG CER p=1.717e-05; MEA all metrics p=2.747e-04; EMG all metrics p=1.758e-02). Auditory feedback (Tables S6, S7): Median error rates show small differences between auditory feedback and no-feedback conditions, but statistical comparisons are not significant (high corrected p-values across PER/WER/CER). Region ablations (Tables S8, S9): Removing precentral and postcentral gyrus channels severely degrades performance (e.g., Speech WER median 96.12% and Text WER 78.68%). Excluding only single regions (PrCG, PoCG, TG) causes moderate performance drops; all-channels vs dual-region ablation shows significant deterioration (corrected p=5.8594e-03). Training data length (Tables S10, S11): Reducing training from ~58.4 h to ~8.6 h markedly increases error rates (e.g., Speech PER median rises to 70.42%; Text PER to 52.99%), with significant differences between longer vs short training sets (corrected p=5.8594e-03 for multiple comparisons).
Discussion

Findings support that streamable ECoG-based decoding paired with modern RNN-T sequence transduction and a personalized high-fidelity vocoder can enable naturalistic communication in real time. Automatic ASR evaluation (Whisper) provides reliable, scalable assessment equivalent to perceptual ratings, facilitating standardized benchmarking. The neuroprosthesis generalizes across modalities (ECoG, MEA, EMG) and decoding objectives (speech units and text), highlighting the robustness of the bimodal pretrained predictors and joiners and the effectiveness of modality-specific transcribers. Performance depends strongly on sensorimotor cortical coverage (precentral/postcentral gyrus) and adequate training duration, aligning with the neurophysiological role of articulatory motor cortex and the data demands of sequence models. Auditory feedback did not significantly change error rates, suggesting model-side temporal consistency and beam search constraints may dominate streaming performance. The clinical protocol demonstrates feasibility for long-term use and outlines safety pathways for expansion to broader indications and assistive device control.

Conclusion

This supplementary work details a comprehensive, streamable brain-to-voice neuroprosthesis integrating ECoG decoding with RNN-T/CTC models and a personalized HiFi-CAR vocoder, achieving above-chance performance in real-time speech synthesis and text decoding and validating ASR-based evaluation. Robustness across modalities and tasks, sensitivity to cortical coverage, and data-length effects provide practical guidance for future clinical deployments. Future research should: expand participant cohorts to assess generalizability; optimize sensor placement and channel selection; improve low-data training through transfer and self-supervision; refine streaming beam search and language modeling; and integrate closed-loop auditory and articulatory feedback to further enhance intelligibility and latency. Continued clinical trials will clarify long-term stability, safety, and user-centered outcomes.

Limitations

Early feasibility and single-participant focus limit generalizability. Many analyses rely on ASR transcripts which, despite equivalence to perceptual ratings, can produce artifacts requiring truncation rules. The vocoder is trained on a single-speaker corpus with voice conversion from a short (3 s) pre-injury clip, which may constrain prosody and speaker variability. Region ablations show strong dependence on precentral/postcentral coverage, implying sensitivity to implant location. Performance degrades notably with limited training data. CTC baselines omitted external language models, possibly underestimating achievable text-decoding accuracy. Clinical protocol emphasizes feasibility rather than pre-registered statistical endpoints.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny