logo
ResearchBunny Logo
Decoding speech perception from non-invasive brain recordings

Computer Science

Decoding speech perception from non-invasive brain recordings

A. Défossez, C. Caucheteux, et al.

Using contrastive learning on non-invasive MEG and EEG recordings from 175 volunteers, this study decodes perceived speech representations—identifying, from 3 seconds of MEG, the correct speech segment among 1,000+ candidates with up to 41% average accuracy and 80% for top participants. The research was conducted by Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli and Jean-Rémi King.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether natural speech perception can be decoded from non-invasive brain recordings (MEG/EEG) using modern deep learning. Current successes in speech BCIs largely rely on invasive recordings that require brain surgery. Non-invasive MEG/EEG are safer and potentially wearable but notoriously noisy, variable across sessions and individuals, and prior decoders often target limited, interpretable features (letters, phonemes, small word sets) trained per participant. The authors propose a single architecture trained across a large cohort to align brain activity with deep, self-supervised speech representations, enabling zero-shot identification of words and phrases not seen during training. This approach aims to overcome noise and inter-individual variability, and to sidestep assumptions about the exact format of language representations in the brain.
Literature Review
Prior work with invasive electrophysiology has decoded phonemes, speech sounds, articulatory movements, letters and words with impressive rates (for example, 90 characters/min with ~94% accuracy; ~15 words/min with ~74% accuracy; spelling via code words at ~29 characters/min). Non-invasive approaches using MEG/EEG have typically focused on hand-crafted features and limited classification tasks (small sets of words, parts of speech, sublexical classes), trained per participant. Self-supervised speech models such as wav2vec 2.0 have been shown to capture linguistic features and linearly relate to brain activity. Despite these advances, robust decoding of natural continuous speech from non-invasive recordings at single-trial resolution and large vocabularies has remained challenging. The present work builds on these insights, leveraging contrastive learning to align brain signals with deep speech representations and pooling data across many participants to improve generalization.
Methodology
Problem formalization: The task is to decode speech from time series of high-dimensional MEG/EEG signals recorded while participants listen to spoken sentences in their native language. Rather than directly regressing Mel spectrograms (which suffers from broadband dominance and scale issues), the model is trained with a contrastive objective to align brain-derived latent representations Z with speech-derived latent representations Y. Model overview: A two-module architecture is used. The speech module extracts deep contextual speech representations from 3 s audio segments; the brain module maps 3 s MEG/EEG segments to latent representations that align with the speech module using a CLIP-style contrastive loss. Contrastive loss (CLIP-style): For each brain segment X, the model computes Z = f_clip(X). With one positive speech representation Y and N−1 negatives Ŷ_j sampled from other segments, the probability that Ŷ_j matches Y is estimated via inner products and a softmax over candidates. Training minimizes cross-entropy, encouraging Z to align with Y and to be discriminative among negatives. Brain module architecture: Input is raw MEG/EEG time series X and a one-hot participant index s. The pipeline includes: (1) Spatial attention layer remapping inputs to D1=270 channels using sensor locations projected to 2D and learned Fourier-parameterized attention over [0,1]^2, with spatial dropout; (2) a 1×1 convolution (no activation); (3) a participant-specific 1×1 convolution (“participant layer”) to leverage inter-individual variability; (4) a stack of five residual dilated convolutional blocks. In each block, the first two 1D convolutions (kernel=3, stride=1) use residual skips, batch normalization, and GELU activation, with dilation increasing cyclically to enlarge receptive field; the third convolution uses a GLU activation, doubling then halving channels. Two final 1×1 convolutions produce F output channels matching the speech representation dimensionality. To account for neural response latency, inputs are shifted by 150 ms. Speech module: Two options were evaluated. Deep Mel: an end-to-end learned latent representation of Mel spectrograms using a similar CNN architecture (without participant block), trained jointly with the brain module. Wav2vec 2.0: a pretrained self-supervised speech representation (wav2vec2-large-xlsr-53), using averaged activations from the last four transformer layers. The pretrained wav2vec 2.0 was superior and used for main results. Datasets: Four public datasets were curated and integrated (total 175 participants): MEG—Schoffelen 2019 (Dutch; 273 sensors; 96 participants; 80.9 h) with sentences and word lists; Gwilliams 2022 (English; 208 sensors; 27 participants; 56.2 h) with four fictional stories. EEG—Broderick 2019 (English; 128 sensors; 19 participants; 19.2 h) with The Old Man and the Sea extracts; Brennan and Hale 2019 (English; 60 sensors; 33 participants; 6.7 h) with Alice in Wonderland. Splits were defined so that identical segments (3 s windows) are assigned consistently across participants, with no identical sentences across splits. Preprocessing: Brain signals resampled to 120 Hz; baseline correction (subtract mean of first 0.5 s), robust scaling, normalization, and clamping values >20 s.d. (shown essential for training). Minimal preprocessing performed; further cleaning (e.g., autoreject) not necessary for this task. Mel spectrograms used 120 Mel bands with STFT frame size 512 and hop 128 at 16 kHz; log compression (log(e + Mel) with e=1e−5). Wav2vec 2.0 features were normalized. Training: Adam optimizer with LR=3e−4, batch size=256, 1,200 updates per epoch. Early stopping after no validation improvement for 10 epochs; best model kept. Direct Mel regression used MSE loss. Training used two NVIDIA V100 GPUs (16 GB). Evaluation: Segment-level top-10 accuracy (preferred over top-1 given large candidate sets). Word-level evaluation aggregates probabilities over segments containing the same word. Decoding is zero-shot in that predicted audio snippets need not appear in training. Prediction analysis: Ridge regression assessed how well decoder probabilities for true words are predicted from linguistic features: phonemes, word frequency (Zipf), part-of-speech, word embeddings (spaCy, 300-d), bag-of-words (sum of word embeddings across segment), and phrase embeddings (LASER, 1024-d). Five-fold cross-validation quantified Pearson’s R.
Key Findings
- Segment-level decoding from MEG: Top-10 accuracy averaged up to 70.7% (Gwilliams MEG) and 67.5% (Schoffelen MEG); top-1 accuracy up to 41.3% (Gwilliams) and 36.8% (Schoffelen). In best MEG participants, top-1 accuracy exceeded 80%. - EEG performance: Lower but above chance—top-10 accuracy 25.7% (Brennan EEG) and 17.7% (Broderick EEG); top-1 accuracy ~5% on both. - Contrastive learning and deep speech representations are crucial: Compared to a base regression-to-Mel decoder, adding a contrastive loss and targeting deep representations markedly increased top-10 accuracy (e.g., Gwilliams MEG: Base 12.4% → +Contrastive 55.1% → +Deep Mel 64.4% → +wav2vec 2.0 70.7%). Similar gains observed across all datasets. - MEG vs EEG: Even after subsampling to homogenize recording time, channels, and participants, MEG decoding remained substantially superior to EEG (Mann–Whitney across participants: all P < 1e−6). - Multi-participant training benefits: Performance steadily increased with the number of participants in training (MEG datasets), indicating learning of shared neural representations plus participant-specific adaptation. - Ablation findings (brain module): Removing skip connections, spatial attention, initial/final 1×1 convolutions, or the participant-specific layer degraded performance. Clamping was essential; removing it collapsed accuracy (e.g., Schoffelen MEG top-10 ~1.5%). - Decoder relies on high-level semantics: Linear prediction of decoder probabilities showed significant contributions from part-of-speech (P < 0.004), word embeddings (P < 1e−8), bag-of-words (P < 1e−23), and phrase embeddings (P < 1e−23). Higher-level features explained more variance, suggesting reliance on lexical and contextual semantic representations. - Zero-shot capability: The model accurately distinguished several hundred words, including words and phrases absent from training, due to alignment with self-supervised speech representations.
Discussion
The findings demonstrate that contrastively aligning non-invasive MEG/EEG with deep, self-supervised speech representations enables effective decoding of natural speech perception at single-trial resolution. MEG affords substantially higher accuracy than EEG, reflecting device sensitivity rather than dataset-specific factors. The architecture’s participant-specific layer, spatial attention, residual dilated convolutions, and minimal preprocessing collectively contribute to robust decoding across heterogeneous cohorts and stimuli. Analyses indicate that the decoder’s predictions depend primarily on high-level lexical and contextual semantics, consistent with the representational content captured by modern speech models. Compared to prior non-invasive work that focused on small vocabularies or hand-crafted audio features, this approach scales to large candidate sets (over 1,000 segments) and supports zero-shot decoding of words/phrases unseen during training. While performance is modest relative to clinical invasive BCIs, results for restricted candidate sets approach those reported with intracranial recordings, underscoring the promise of non-invasive methods for future communication decoding.
Conclusion
This work introduces a non-invasive brain-to-speech decoding framework that leverages contrastive learning to align MEG/EEG with pretrained speech representations (wav2vec 2.0), trained across a large cohort with participant-specific adaptation. The model achieves strong segment-level decoding from MEG and meaningful performance from EEG, scales across datasets, and relies on high-level semantic features, enabling zero-shot identification of unseen words and phrases. Future directions include: (1) adapting the framework to speech production by integrating pretrained models of motor or production tasks; (2) augmenting contrastive decoding with generative modules to estimate phonemes/words/sentences without predetermined candidate sets; (3) improving clinical applicability via advances in MEG hardware (e.g., room-temperature sensors) and robust AI-based noise mitigation; and (4) expanding datasets and further investigating the structure of high-level language representations in brain activity.
Limitations
The approach focuses on speech perception rather than production, limiting direct clinical applicability for communication restoration. EEG decoding performance is substantially lower than MEG, and the device superiority was inferred without presenting identical stimuli across MEG and EEG participants. The contrastive objective identifies the most likely candidate from a predefined set, necessitating future generative modeling for open-vocabulary decoding. Non-invasive signals remain noisy and variable across individuals and sessions; although mitigated by architecture and minimal preprocessing, generalization to clinical environments may be challenging. Evaluation used 3 s windows and specific alignment assumptions (150 ms offset), and performance—though promising—is still modest for unrestricted vocabularies.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny