Medicine and Health
Online speech synthesis using a chronically implanted brain-computer interface in an individual with ALS
M. Angrick, S. Luo, et al.
Using a chronically implanted BCI, researchers achieved online synthesis of intelligible words from electrocorticographic signals in a man with ALS, decoding and synthesizing freely chosen commands from a six-word vocabulary with 80% recognition accuracy while preserving the participant’s voice profile. Research conducted by Miguel Angrick, Shiyu Luo, Qinwan Rabbani, Daniel N. Candrea, Samyak Shah, Griffin W. Milsap, William S. Anderson, Chad R. Gordon, Kathryn R. Rosenblatt, Lora Clawson, Donna C. Tippett, Nicholas Maragakis, Francesco V. Tenore, Matthew S. Fifer, Hynek Hermansky, Nick F. Ramsey, and Nathan E. Crone.
~3 min • Beginner • English
Introduction
A variety of neurological disorders, including amyotrophic lateral sclerosis (ALS), can severely affect speech production and other purposeful movements while sparing cognition, resulting in varying degrees of communication impairments including Locked-In Syndrome. Individuals may rely on augmentative and alternative technologies to select options on a communication board, but this can be slow, effortful, and may require caregiver intervention. Advances in implantable brain-computer interfaces have demonstrated feasibility for communication through direct brain control strategies, yet many modalities remain slower and less intuitive than speech-based BCI control. Recent studies have explored decoding attempted speech from brain activity, outputting text or acoustic speech, using models such as linear regression, convolutional and recurrent neural networks, encoder–decoder architectures, and concatenative approaches. Voice activity has been identified in ECoG and stereotactic EEG recordings, and speech decoding has been performed at the level of phonemes, vowels, words and articulatory gestures.
Until now, brain-to-speech decoding has primarily been reported in individuals with unimpaired speech (e.g., epilepsy surgery patients), leaving uncertainty regarding translation to individuals with motor speech impairments such as ALS. Recent work has demonstrated recovery of text from ECoG or microelectrode recordings in patients with anarthria or dysarthria, and real-time synthesis of vowel sounds or sentences in cued settings. In prior work, a plug-and-play system allowed a clinical trial participant living with ALS to issue commands via speech-based BCI control. Non-invasive BCIs for speech decoding (EEG, fNIRS, fMRI, MEG) have been investigated, largely focusing on imagined speech; while they avoid surgical risks and may be adopted more broadly, they currently lack temporal or spatial resolution or feasibility for at-home use.
Here, we show that an individual living with ALS and participating in a clinical trial of an implantable BCI was able to produce audible, intelligible words closely resembling his own voice, spoken at his own pace, through online decoding of ECoG signals from cortical regions representing articulation and phonation. The participant had impairments in articulation and phonation; individual words were intelligible but sentences often were not. We focused on a closed vocabulary of 6 keywords used to control a communication board, which the participant could reliably produce. Training data were acquired over 6 weeks and the speech synthesis BCI was deployed in several closed-loop sessions. Because the participant could still produce speech, neural and acoustic signals could be reliably time-aligned to map cortical activity during overt speech to acoustic features. Delayed auditory feedback was provided to mitigate anticipated discordance between actual and synthesized speech with disease progression. This work demonstrates a self-paced BCI translating brain activity directly to acoustic speech preserving voice characteristics, with most synthesized words recognized by human listeners, supporting translation of speech synthesis from neural signals to individuals with neurological speech impairments and further evidencing ECoG stability for speech BCIs.
Literature Review
The paper reviews prior efforts in speech-based BCIs and neural speech decoding: reconstruction of acoustic speech from ECoG using linear regression, CNNs, RNNs, encoder–decoder frameworks, and concatenative approaches; identification of voice activity in ECoG and sEEG; decoding at levels of phonemes, vowels, words, and articulatory gestures. It contrasts invasive ECoG/microelectrode approaches with non-invasive methods (EEG, fNIRS, fMRI, MEG) focusing on imagined speech due to movement artifacts. It cites demonstrations of text recovery from ECoG in anarthric patients, microelectrode-based decoding in ALS, real-time synthesis of vowels, and sentence synthesis in cued settings. Prior plug-and-play speech command decoding for communication board control in ALS is noted. The review highlights the gap in translating brain-to-speech decoding from unimpaired speakers to those with motor speech impairments and emphasizes the potential stability advantages of ECoG for long-term BCIs.
Methodology
Approach: The system consists of three recurrent neural networks to (1) identify and buffer speech-related neural activity, (2) transform sequences of speech-related neural activity into an intermediate acoustic representation, and (3) synthesize the acoustic waveform via a vocoder. ECoG signals were acquired from two 8×8 subdural electrode arrays covering ventral sensorimotor cortex (face/tongue) and upper limb regions. A subset of 64 electrodes showing significant high-gamma changes during overt speech was used. Broadband high-gamma power features (70–170 Hz) were extracted (two 8th-order IIR Butterworth filters, attenuation at 118–122 Hz) with 50 ms windows and 10 ms frameshift, normalized per day using statistics from a syllable-repetition baseline task.
Neural Voice Activity Detection (nVAD): A unidirectional RNN (two LSTM layers, 150 units each; linear output layer with two logits) identifies speech onset and segments for buffering, employing temporal context correction and a 0.5 s context window for smoother transitions. Truncated backpropagation through time was used with k1=50 and k2=100 frames (1 s unfolding, updates every 500 ms), dropout 50%, cross-entropy loss, RMSprop optimizer with initial learning rate 0.001, early stopping on validation accuracy. Trained for 27,975 update steps with 311,102 trainable parameters; achieved 93.4% frame-wise accuracy on validation.
Decoding model: A bidirectional LSTM architecture maps high-gamma features to acoustic parameters: 18 Bark-scale cepstral coefficients plus 2 pitch parameters (LPC coefficients). Linear output layer for regression. Standard BPTT over extracted speech segments; dropout 50%, RMSprop with initial lr 0.001, early stopping. Trained for 14,130 update steps with 378,420 trainable parameters. Estimated LPC coefficients are converted to waveforms using LPCNet (pretrained weights, C implementation, Cython wrappers). Synthesized speech is played back as delayed auditory feedback after each utterance.
Closed-loop architecture: Implemented in Python 3.10 using ezmsg (publish/subscribe asynchronous coroutines) in a directed acyclic network. ECoG data streamed from BCI2000 via custom ZeroMQ over TCP/IP in 40 ms packets. Units for feature extraction, nVAD, decoding, vocoder, and audio playback; logging units run in separate processes. PCM bytes piped to SoX for playback. Performance-critical components (ringbuffers, high-gamma calculations) implemented in Cython.
Participant: Male native English speaker in his 60s with ALS diagnosed 8 years prior; bulbar and upper extremity impairments; continuous speech mostly unintelligible, individual words intelligible; ALSFRS-R score 26/48 (speech rated 1). Informed consent obtained; implanted July 2022.
Study device and implantation: Two 8×8 subdural grids (PMT Corporation) connected to a percutaneous 128-channel Neuroport pedestal (Blackrock Neurotech). Platinum–iridium disc electrodes (2 mm exposed diameter, 4 mm spacing); total surface area 12.11 cm². Placement over left hemisphere sensorimotor representations for speech and upper extremity movements during awake craniotomy; targeting via structural MRI, fMRI, and intraoperative somatosensory evoked potentials. Two subdural reference wires attached to pedestal; post-op CT co-registered to pre-op MRI for verification.
Data recording: Neural signals sampled at 1000 Hz via NeuroPlex-E headstage to NeuroPort Biopotential Signal Processor; synchronized audio recorded via microphone (SHURE BETA 58A) and Zoom H6; audio split to NSP analog input (synchronized 1 kHz) and 48 kHz high-quality recording; cross-correlation used for alignment; downsampled to 16 kHz for processing. Energy-based VAD estimated boundaries. LPC features computed via LPCNet encoding with 10 ms frameshift; acoustic padding of 16 ms to account for filter shift.
Experiment recordings and task: Each day began with syllable repetition (12 consonant–vowel syllables, each repeated 5 times) for baseline normalization. Training for nVAD and decoding: over 6 weeks, participant read single words on screen from a 6-keyword vocabulary (Left, Right, Up, Down, Enter, Back) with occasional silence trials. Ten repetitions per word per block (60 words), pseudo-random order; word displayed 2 s, intertrial interval 3 s. Offline analysis ensured ≥1 s between end of speech and next trial onset to prevent leakage. Training/validation/test collected over 10 days; online deployment occurred 5.5 months later. During online sessions, synthesized output was played after each spoken word. Dataset sizes: training 1,570 trials (~80 min total; 21.8 min pure speech); validation and test 70 trials each (~3 min total; 0.9 min pure speech). Datasets collected on separate days to avoid leakage of baseline statistics.
Signal processing and channel selection: CAR within each 64-contact grid (excluding poor-quality channels). Selected 64 channels across motor, premotor, somatosensory cortices including dorsal laryngeal area, based on significant high-gamma responses in syllable task.
Contamination analysis: Assessed acoustic contamination using Roussel et al. method comparing neural and acoustic spectrogram correlations; computed contamination index vs permuted distributions. Identified one contaminated channel (46) on three days (D5, D6, D7); corrected by averaging neighboring channels (8-neighbor, excluding bad channel 38). After exclusion, statistical tests indicated no significant contamination (p>0.05).
Listening test: Forced-choice test (BeaqleJS framework) with 21 native English speakers (ages 18–30; 23.8% male, 61.9% female, 14% other/prefer not to answer). Participants listened to randomized samples from three closed-loop sessions including synthesized and original words, selecting one of six keywords; could replay samples before selection. Two samples excluded due to nVAD failing to detect speech; some falsely detected speech segments yielding synthesized silence were also excluded.
Statistical analysis: Original vs reconstructed speech spectrograms compared via Pearson correlation over 80 mel-scaled bins (STFT: 50 ms window, 10 ms frameshift, Hanning window; log Gaussianization). Contamination analysis via permutation tests and t-tests with Bonferroni correction (α=0.05/N). SciPy stats (v1.10.1) used; contamination analysis in Matlab (toolbox v12.4).
Key Findings
• The closed-loop speech-synthesis BCI produced acoustic speech resembling the participant’s voice and pacing, reliably synthesizing isolated keywords from a 6-word vocabulary.
• Human listeners achieved 80% accuracy recognizing synthesized words (chance level 16.7%), indicating high intelligibility on a closed vocabulary; listeners recognized between 75% and 84% individually.
• The participant’s natural spoken keywords were recognized at 99.8% accuracy by listeners.
• Confusion predominantly involved “Back” being mistaken for “Left,” potentially due to similar vowel formant structures and the participant’s weak tongue movements; “Up” and “Left” were well distinguished.
• Decoder synthesized an intelligible but incorrect word in only ~4% of cases, and listeners correctly recognized the incorrect word when it occurred.
• Spectral comparison between original and synthesized speech yielded average correlation of 0.67 (±0.18 SD), suggesting preservation of phoneme/formant-specific information.
• Voice activity timing comparison showed a median Levenshtein distance of 235 ms between spoken and synthesized voice activity, indicating adequate timing match.
• Three online sessions across three different days (each ~6 min), conducted ~5.5 months after training data collection, produced consistent performance with recalibration but no retraining.
• Saliency analysis revealed that high-gamma changes across motor, premotor, and somatosensory cortices predominantly influenced speech onset detection, with maximal relevance from ~300 to 100 ms prior to predicted speech onset; dorsal laryngeal area contributed only modestly.
Discussion
The study demonstrates feasibility of online synthesis of intelligible words from ECoG signals in a person with ALS using a closed-loop BCI composed of three RNNs. Deep learning techniques effectively reconstructed acoustic speech from high-gamma activity, achieving 80% intelligibility on a keyword reading task, and preserving individual voice characteristics. The vocabulary was chosen for intuitive command-and-control of a communication board, not for optimal discriminability, which likely contributed to confusions between acoustically similar words (e.g., “Back” vs “Left”). Saliency analysis linked model decisions to the spatiotemporal neural dynamics of speech production, showing reliance on motor, premotor, and somatosensory cortices with pre-onset planning activity.
ECoG offers potential advantages for long-term BCIs due to signal stability; the model trained months prior generalized across days with recalibration but no retraining, supporting stability of the neural-to-acoustic mapping. Delayed auditory feedback was intentionally used to avoid interference with overt speech and to provide clear output in the face of progressive speech deterioration.
While effective for a closed vocabulary, the current bidirectional decoding approach operated at the word level and did not synthesize unseen words or finer granular units (syllables/phonemes). Nonetheless, prior work suggests that bidirectional architectures can generalize to unseen elements, motivating future exploration to expand vocabulary and unit granularity. The work provides a proof-of-concept that ECoG-based online speech synthesis can augment communication in ALS, encouraging continued research to preserve or restore vocal communication in clinical populations.
Conclusion
A chronically implanted ECoG-based BCI enabled an individual with ALS to produce intelligible, voice-preserving synthetic speech in real time for a set of self-paced keywords, with 80% recognition accuracy by human listeners and stable performance across sessions months after training. The system’s neural voice activity detection and bidirectional decoding mapped high-gamma activity to LPC features and synthesized waveforms via LPCNet, capturing timing and spectral characteristics comparable to natural speech. Saliency analysis connected model behavior to established speech motor planning dynamics.
This proof-of-concept supports ECoG’s suitability for stable, speech-based BCIs and highlights the potential for augmentative communication that preserves personal voice identity. Future work should: (1) expand to larger vocabularies and phoneme/syllable-level units to enable synthesis of unseen words and continuous speech; (2) evaluate long-term robustness amid disease progression; (3) refine real-time feedback strategies; and (4) explore alignment strategies using voice banking or generic voices for users with absent or unintelligible speech.
Limitations
• Vocabulary restricted to six keywords; the bidirectional word-level decoder did not generalize to unseen words or finer units (syllables/phonemes).
• Delayed feedback was required due to bidirectional decoding, precluding simultaneous auditory feedback during speech, which may limit certain use cases.
• Approach was trained on overt speech, constraining applicability to individuals who can vocalize; performance with absent or highly unintelligible speech requires alignment strategies (e.g., voice banking) and remains to be validated.
• Confusions between acoustically similar words (e.g., “Back” vs “Left”) reflect limitations in discriminability compounded by the participant’s dysarthria.
• Long-term stability amid disease progression was not explicitly tested beyond months; further longitudinal studies are needed.
• Potential acoustic contamination of ECoG was addressed by detection and correction of one channel across specific days, but contamination risks require vigilant monitoring in other settings.
Related Publications
Explore these studies to deepen your understanding of the subject.

