logo
ResearchBunny Logo
A neural speech decoding framework leveraging deep learning and speech synthesis

Engineering and Technology

A neural speech decoding framework leveraging deep learning and speech synthesis

X. Chen, R. Wang, et al.

This study introduces a novel deep learning neural speech-decoding framework that pairs an ECoG decoder with a differentiable speech synthesizer and a speech-to-speech auto-encoder to produce interpretable speech parameters and natural-sounding speech. The approach is highly reproducible across 48 participants, yields high correlation even with causal operations suitable for real-time prostheses, and works with left or right hemisphere coverage. This research was conducted by Xupeng Chen, Ran Wang, Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Friedman, Werner Doyle, Orrin Devinsky, Yao Wang, and Adeen Flinker.

00:00
00:00
~3 min • Beginner • English
Introduction
Speech loss due to neurological deficits severely impairs communication. Electrocorticography (ECoG) during epilepsy monitoring offers high-resolution cortical recordings paired with speech, enabling development of neural speech prostheses. Two core challenges hinder decoding: limited subject-specific neural-speech data for training deep models, and substantial intra-speaker variability in speech rate, intonation and pitch that complicates model representations. Prior approaches span linear models with modest accuracy and interpretability, and deep networks mapping neural activity to diverse intermediate representations or directly to speech. However, many use non-causal architectures that rely on future neural context and auditory feedback signals, limiting real-time applicability. This study proposes a neural speech decoding framework that maps ECoG to a compact, interpretable acoustic parameter space (pitch, voicing, formants, loudness), followed by a differentiable speech synthesizer to generate spectrograms. A speech-to-speech auto-encoder pre-trained on each participant’s speech provides guidance targets for the ECoG decoder, addressing data scarcity and enabling subject-specific, natural-sounding output. The research asks whether such a framework can deliver accurate, causal speech decoding across a large cohort, generalize to unseen words, function with low-density electrode grids, and decode from either hemisphere, thereby informing the feasibility of real-time neural prostheses.
Literature Review
Earlier neural speech decoding methods used linear models to reconstruct spectrograms from cortical signals with Pearson correlation around 0.6 or lower, favoring interpretability and small data. Deep learning approaches introduced convolutional and recurrent networks, differing in intermediate representations and synthesis quality. Articulatory-space decoders achieved robust decoding but with less natural-sounding voices. Other methods achieved more natural audio using WaveNet vocoders, GANs, or unit selection but with limited accuracy. A recent single-participant study used quantized HuBERT features and a pretrained synthesizer to achieve high performance and natural speech, but HuBERT features lack speaker-specific acoustics, requiring additional speaker adaptation. Moreover, most prior work used non-causal architectures that leverage future neural context and auditory feedback, limiting real-time BCI feasibility. The present work addresses these gaps via a compact, interpretable acoustic intermediate representation, a differentiable synthesizer, and causal ECoG decoder architectures.
Methodology
Framework overview: The pipeline comprises (1) an ECoG decoder that maps cortical high-gamma features to 18 time-varying speech parameters (voice weight, loudness, pitch f0, six formant center frequencies and amplitudes, and a broadband unvoiced filter’s frequency, bandwidth, amplitude), and (2) a differentiable speech synthesizer that converts these parameters into spectrograms. A separate speech-to-speech auto-encoder (speech encoder + same synthesizer) is pre-trained per participant using speech only to generate reference parameter trajectories guiding ECoG decoder training. Participants and tasks: N=48 native English-speaking neurosurgical epilepsy patients (32 left-hemisphere, 16 right-hemisphere coverage) provided synchronized ECoG and speech during five tasks designed to elicit the same 50 target words across modalities: auditory repetition (AR), auditory naming (AN), sentence completion (SC), visual reading (VR/WR), and picture naming (PN). Each participant produced 400 trials (50 words; varying repetitions per task). Models were trained per participant using 80% trials and tested on 20% (50 trials). A stringent word-level 5-fold cross-validation held out 10 unique words per fold (no overlap in words across train/validation for both auto-encoder pre-training and ECoG decoder training). ECoG acquisition and preprocessing: ECoG recorded at 2,048 Hz (downsampled to 512 Hz), referencing via common average, artifact and epileptiform electrode exclusion. High-gamma (70–150 Hz) envelopes were extracted using the Hilbert transform and downsampled to 125 Hz. Per-electrode z-scoring used silent pre-stimulus reference segments. Speech denoising via spectral gating was applied for noisy audio. Grids: 43 participants had low-density (LD) 8×8 macro electrodes (1 cm spacing); five had hybrid-density (HB) arrays with interleaved micro contacts (5 mm micro/macro spacing) plus the macro grid. Speech synthesizer: Inspired by classical vocoders and implemented differentiably. It mixes a voiced component (harmonic excitation at f0 passed through a sum of six time-varying formant filters; each filter specified by center frequency and amplitude with bandwidth governed by a learned global linear relation to frequency) and an unvoiced component (stationary white noise filtered by a broadband filter plus the same set of formant filters to capture consonant transitions). The mixture is weighted by voice weight and scaled by loudness, with participant-specific stationary background noise added. Speaker-dependent prototype filters (piecewise-linear, unimodal) are learned, and background spectral profiles are learned (K=256 bins for females; 512 for males). The input parameterization is compact (18 parameters per frame) and disentangles pitch from vocal tract filtering. Speech encoder: A lightweight temporal-convolution and channel-MLP network extracts the 18 parameters from linear and Mel spectrograms. For frequency-like outputs (f0, formants, broadband center), sigmoid activations map to [0,1] then de-normalize to physiologically plausible ranges from literature. The encoder and synthesizer are pre-trained via semi-supervised auto-encoding on the participant’s own speech using a multi-scale spectral loss (L1 on linear and Mel spectrograms), a differentiable STOI+ intelligibility loss, and supervised losses on f0 and first four formant frequencies estimated by Praat. ECoG decoders and causality: Three interchangeable architectures were implemented with both non-causal and causal variants: (a) 3D ResNet processing spatiotemporal ECoG tensors with initial temporal convolutions followed by 3D residual blocks, down/up-sampling and parameter heads; (b) 3D Swin Transformer with shifted 3D window self-attention across temporal and spatial dimensions and upsampling via transposed temporal convolutions; (c) LSTM (3 layers) with channel MLP. Causal models restrict temporal convolutions or attentions to past and current frames (unidirectional LSTM), whereas non-causal use bidirectional/contextual operations. Training objectives and optimization: ECoG decoder training combines (1) spectral reconstruction (modified multi-scale spectral loss), (2) STOI+ loss (negative of STOI+), (3) supervised loss on f0 and first four formants vs Praat estimates, and (4) reference loss to the auto-encoder’s parameter trajectories with parameter-specific weights (e.g., higher weights for formant amplitudes and broadband frequency). The overall decoder loss is a weighted sum with empirically tuned coefficients (λ1=1.2 for STOI+, λ2=0.1 for supervision, λ3=1 for reference). Adam optimizer used with lr=1e-3, β1=0.9, β2=0.999. Models trained separately per participant. Evaluation: Primary metric is Pearson correlation coefficient (PCC) between decoded and ground-truth spectrograms; STOI+ is also reported. A strict word-level CV evaluates generalization to unseen words. An occlusion-based contribution analysis assesses the effect of zeroing individual electrode signals on decoding PCC, projecting contributions to MNI space with density normalization and estimating noise levels via training on shuffled neural-speech pairs.
Key Findings
- Across N=48 participants, convolutional (3D ResNet) and transformer (3D Swin) decoders achieved high decoding accuracy with both non-causal and causal operations. Mean PCC between original and decoded spectrograms: ResNet 0.806 (non-causal) vs 0.797 (causal), not significantly different (Wilcoxon signed-rank P=0.093); Swin 0.792 (non-causal) vs 0.798 (causal), not significantly different (P=0.196). LSTM underperformed and showed a significant drop when causal (0.745 non-causal vs 0.712 causal; P=0.009). - Performance on stringent word-level held-out sets (10 unique words excluded from any training or pre-training per fold) was comparable to trial-based validation, indicating generalization to unseen words. - Decoded speech parameters closely matched reference trajectories across participants: mean PCCs approximately 0.781 (voice weight), 0.571 (loudness), 0.889 (pitch f0), 0.812 (formant f1), 0.883 (formant f2). Accurate pitch and lower-formant recovery underpins natural, speaker-specific synthesis. - Right-hemisphere decoding was robust and comparable to left-hemisphere decoding. Using causal models: ResNet PCC ≈0.790 (right) vs left similar; no significant difference (ResNet t-test P=0.623; Swin P=0.968). STOI+ comparisons were also not significantly different. - Electrode density: Hybrid-density (HB) and low-density (LD) grids yielded similar PCC performance for both ResNet and Swin. Within HB participants, using all electrodes vs only the LD subset showed no significant PCC difference for 4/5 participants (Wilcoxon P=0.114, 0.003, 0.0773, 0.472, 0.605), suggesting effective decoding with lower spatial sampling. - Contribution analysis: Non-causal models showed elevated dependence on superior temporal gyrus (auditory cortex), consistent with use of feedback signals; causal models reduced STG reliance and emphasized sensorimotor regions (including ventral sensorimotor cortex), aligning with real-time feasibility. - Latency: The pipeline can synthesize speech from neural signals with sub-50 ms delays (per supplementary), compatible with real-time BCI constraints. - Open-source availability: The complete decoding framework and code are publicly available, supporting reproducibility and cross-group benchmarking.
Discussion
The study demonstrates that mapping ECoG to a compact, interpretable acoustic parameter space, coupled with a differentiable synthesizer and guidance from a pre-trained speech encoder, enables accurate, natural-sounding speech decoding across a large cohort. Crucially, causal versions of convolutional and transformer decoders perform on par with non-causal counterparts, addressing a central requirement for real-time neural prostheses and reducing reliance on auditory feedback pathways. Robust decoding with low-density clinical grids suggests broader clinical practicality without requiring high-density arrays. The comparable right-hemisphere decoding performance indicates that speech-relevant information sufficient for decoding exists bilaterally, motivating exploration of right-hemisphere implants for patients with left-hemisphere damage. The interpretable acoustic intermediate representation allows subject-specific reconstruction and offers advantages over direct spectrogram mapping or generic learned features (e.g., HuBERT) by preserving speaker identity and enabling effective training with limited paired data through parameter guidance. The occlusion analysis clarifies cortical contributions, with causal models emphasizing sensorimotor regions essential for feedforward speech production. Collectively, these findings address the challenges of data scarcity, real-time causality, and clinical deployment constraints, advancing the feasibility of neural speech BCIs.
Conclusion
This work introduces a neural speech decoding framework that leverages an interpretable acoustic intermediate representation and a differentiable synthesizer to produce natural, subject-specific speech from ECoG signals. It provides strong evidence that causal convolutional and transformer decoders can achieve high decoding accuracy comparable to non-causal models, a critical step toward real-time BCIs. The approach generalizes to unseen words, performs well with low-density electrode grids, and decodes robustly from either hemisphere, suggesting broader clinical applicability, including potential right-hemisphere targets in patients with left-hemisphere damage. The publicly released codebase enables reproducible research and cross-study comparisons. Future directions include: adapting training to imagined or mimed speech paired with prior or proxy audio to address lack of overt speech; extending decoders to non-grid electrode configurations (e.g., strips, sEEG) with appropriate architectures; scaling from word-level to continuous sentence-level decoding; and validating right-hemisphere decoding in patients with left-hemisphere lesions.
Limitations
- Requirement for paired ECoG and speech during training limits applicability to patients without overt speech; training with imagined/mimed speech and proxy or historical audio will require strategies to handle temporal misalignment. - Current ECoG decoders assume grid-based spatial sampling; generalization to strips and depth electrodes (sEEG) will need new architectures. - Experiments focused on word-level decoding of a 50-word vocabulary; performance may differ for continuous, large-vocabulary speech. - Right-hemisphere feasibility, while promising in neurologically intact hemispheres, remains to be validated in patients with left-hemisphere damage. - Although sub-50 ms delays were demonstrated in analysis, full real-time closed-loop system validation was not the focus of this study.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny