Linguistics and Languages
Decoding speech perception from non-invasive brain recordings
A. Défossez, C. Caucheteux, et al.
This innovative research by Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King unveils a cutting-edge contrastive learning model that decodes speech perception with remarkable accuracy from non-invasive MEG and EEG recordings. With up to 41% accuracy and 80% in the best cases, this study promises a revolutionary approach to understanding language processing.
~3 min • Beginner • English
Introduction
The study addresses the challenge of decoding natural speech from non-invasive brain recordings, a longstanding goal for communication restoration in patients with speech impairments due to traumatic brain injury, stroke, or neurodegenerative disease. While invasive brain–computer interfaces (BCIs) have achieved high accuracies in decoding characters, words, and speech features from intracranial recordings, they require neurosurgery and are difficult to maintain chronically. Non-invasive modalities like MEG and EEG are safer and potentially wearable but produce noisy, variable signals and have typically relied on hand-crafted features and single-participant decoders targeting limited sets of interpretable outputs (e.g., Mel spectrograms, letters, small vocabularies). This work proposes a scalable alternative: decode perceived speech from MEG/EEG by training a single architecture across many participants to align brain activity with deep, self-supervised speech representations. The central research question is whether contrastive learning with pretrained speech representations can enable accurate, zero-shot identification of heard speech segments from brief windows of non-invasive recordings and what linguistic representations underlie this decoding.
Literature Review
Prior work with invasive recordings has demonstrated decoding of phonemes, speech sounds, articulatory movements, and even continuous text, with high accuracies and speeds (e.g., 90 characters/min at 94% accuracy; 15.2 words/min at 74.4% accuracy). However, these require surgical implantation. Non-invasive MEG/EEG studies have typically targeted limited feature spaces (phonemes, syllables, tones) or small vocabularies, using hand-engineered pipelines and single-subject training, achieving good performance on constrained tasks but lacking scalability to natural language. Regression to Mel spectrograms is common but may overemphasize broadband components and treat all frequencies equally. Recent self-supervised speech models (e.g., wav2vec 2.0) learn representations that map linearly to cortical responses and encode diverse linguistic features, suggesting a route to better alignment with brain activity. The field lacks standardized open datasets and reproducible code, hindering direct comparisons; however, vocabulary sizes in non-invasive decoding have generally been small compared to the present study’s large, partially non-overlapping vocabularies.
Methodology
Problem formulation: Decode perceived speech from 3 s segments of non-invasive MEG/EEG (C channels × T time steps) by learning a function that maps brain signals X to latent speech representations Y (F features × T time steps), matching the sample rate. Rather than direct regression, the model uses a contrastive objective to align brain-derived representations Z with target speech representations.
Model overview: A two-module architecture aligns a brain module with a speech module using a CLIP-style contrastive loss. Given brain segment X and its corresponding speech representation Y (positive), and N−1 negatives Y_j sampled from the dataset/batch, the model produces Z = f_brain(X). Probabilities over candidates are computed using inner products and a softmax; training minimizes cross-entropy so that the positive sample scores highest. At test time, all test segments serve as negatives.
Contrastive loss: The CLIP loss encourages discriminative alignment between Z and Y across a batch, naturally focusing on informative speech features and appropriate scaling, in contrast to L2 regression which equally weights all Mel dimensions and can be dominated by broadband components.
Brain module: Inputs raw MEG/EEG X (resampled to 120 Hz) and a one-hot participant index s. Architecture: (1) Spatial attention layer maps sensor space to D1=270 channels via learned Fourier-parameterized attention fields over 2D-projected sensor locations (softmax attention with spatial dropout). (2) A 1×1 convolution (no activation). (3) Participant-specific layer: a learned 1×1 convolution (matrix M_s) per participant along channels to capture inter-individual variability. (4) Five residual dilated convolutional blocks: each block has two residual 1D convolutions (kernel size 3, stride 1, dilations increasing cyclically by 2^(2k mod 5) and 2^((2k+1) mod 5)), batch normalization and GELU activations, outputting D2=320 channels; followed by a third non-residual GLU-activated convolution outputting 2D2 channels then halved by GLU. (5) Two final 1×1 convolutions with GELU in between to produce F output channels matching the speech representation’s dimensionality. To account for neural latencies, brain inputs are shifted by 150 ms into the future relative to audio.
Speech module: Two alternatives for target representations: (a) Deep Mel: a speech-side convolutional network (same as brain module without participant block) trained end-to-end jointly with the brain module to learn latent Mel-derived targets; (b) Pretrained wav2vec 2.0: use wav2vec2-large-xlsr-53, averaging the last four transformer layers as Y. The pretrained option proved superior and is the main focus.
Datasets: Four public datasets comprising 175 participants: two MEG (Schoffelen 2019: Dutch; 96 participants; 273 sensors; 80.9 h; Gwilliams 2022: English; 27 participants; 208 sensors; 56.2 h) and two EEG (Broderick 2019: English; 19 participants; 128 channels; 19.2 h; Brennan & Hale 2019: English; 33 participants; 60 channels; 6.7 h). Stimuli are natural stories or sentences; splits (≈70/20/10% train/val/test) ensure identical sentences/segments are not shared across splits and segments are consistently assigned across participants.
Preprocessing: MEG/EEG downsampled to 120 Hz; baseline correction (subtract first 0.5 s mean per channel), robust scaling, and clamping values above 20 s.d. Audio: Mel spectrograms with 120 Mel bands (STFT frame 512, hop 128, 16 kHz audio, log compression log(e + Mel), e=1e−5). For wav2vec 2.0, average last four layers; standard normalization applied.
Training: Adam optimizer, learning rate 3×10^−4, batch size 256; one epoch = 1,200 updates; early stopping after 10 epochs without validation improvement; best model by validation loss. CLIP loss for contrastive models; MSE for direct Mel regression. Training used two V100 GPUs (16 GB). Hyperparameter analyses provided in Supplementary.
Evaluation: Segment-level: top-10 and top-1 accuracy over all test segments (zero-shot identification among >1,000–1,800 segments depending on dataset). Word-level: group segment probabilities by word occurrence within segments to compute top-1/top-10 word accuracy. Prediction analysis: linear ridge regression predicts decoder’s probability of the true word from features (phonetic features, Zipf frequency, POS, word embeddings, bag-of-words, phrase embeddings) with 5-fold CV, assessed by Pearson’s R. Statistics: Wilcoxon across participants for model comparisons within datasets; Mann–Whitney across participants for cross-dataset comparisons.
Key Findings
Decoding performance:
- MEG: With wav2vec 2.0 targets and contrastive loss, average top-10 segment accuracy reached 70.7±0.1% (Gwilliams) and 67.5±0.4% (Schoffelen); corresponding top-1 accuracies were 41.3±0.1% and 36.8±0.4% (Extended Data Fig. 1). Best individual participants exceeded 80% top-1 accuracy (Gwilliams dataset).
- EEG: Average top-10 segment accuracy reached 25.7±2.9% (Brennan & Hale) and 17.7±0.6% (Broderick), far above random baselines.
- Random baseline top-10 accuracy was <1% for MEG and ≈0.5–5.3% for EEG depending on dataset.
Model component contributions (Table 2 and ablations):
- Switching from regression on Mel (“Base model”) to contrastive loss (“+ Contrastive”) greatly improved decoding (e.g., Gwilliams MEG: 12.4% → 55.1% top-10; Schoffelen MEG: 20.6% → 55.1%).
- Learning deep Mel targets (“+ Deep Mel”) further improved (Gwilliams: 64.4%; Schoffelen: 61.2%).
- Using pretrained wav2vec 2.0 as speech targets yielded the best results (Gwilliams: 70.7%; Schoffelen: 67.5%; Broderick EEG: 17.7%; Brennan EEG: 25.7%).
- Brain module ablations showed significant drops when removing spatial attention, participant-specific layer, skip connections, and clamping; the participant layer and clamping were especially critical (e.g., removing subject layer reduced top-10 to 42.4%/47.0% on Schoffelen/Gwilliams; removing clamping reduced to 1.5%/23.6%). Many effects were highly significant (e.g., p < 10^−22 to < 10^−29; dataset-dependent).
MEG vs EEG: After subsampling to homogenize number of channels, participants, and duration, MEG decoding remained substantially better than EEG (Mann–Whitney across participants: all P < 10^−9), indicating device type primarily drives performance differences.
Scaling with participants: Decoding steadily improved with more training participants in MEG datasets, indicating effective learning of shared neural representations plus subject-specific adjustments.
Representational analysis: Decoder predictions correlated most with higher-level linguistic features. Linear models significantly predicted trial-wise probabilities of the true word from POS (P < 0.004), word embeddings (P < 10^−12), bag-of-words embeddings (P < 10^−23), and phrase embeddings (P < 10^−23); higher-level features explained more variance, suggesting reliance on lexical and contextual semantics.
Comparisons to prior scales: Despite being non-invasive and focused on perception, the model achieved, for constrained 50-word tasks, top-1 accuracies up to 42.9% for isolated words and above 72.5% on average within-sentence with best participants up to ~92–96%, approaching some invasive benchmarks reported for production tasks.
Discussion
The findings demonstrate that short windows (3 s) of non-invasive MEG/EEG contain sufficient information to identify the corresponding speech segment among thousands of candidates, enabling zero-shot decoding of words and phrases not seen during training. This addresses the dual challenge of noisy, variable non-invasive signals and uncertainty about neural language representations by: (1) employing a robust contrastive objective that focuses on informative dimensions; (2) leveraging deep, self-supervised speech representations that align with cortical processing; and (3) training a unified model across many participants while incorporating a participant-specific layer to handle individual variability.
The superiority of wav2vec 2.0 targets over Mel-based targets indicates that high-level speech representations are better suited for brain decoding than low-level acoustics. Ablation results confirm that architectural choices—spatial attention, residual dilated convolutions, GLU/GELU non-linearities, subject-specific layers, and clamping—are integral to performance. MEG markedly outperforms EEG in this task, even when controlling for data size and channels, implicating device characteristics as a primary factor.
Interpretability analyses indicate that the decoder’s success is best explained by high-level lexical and semantic features (phrase embeddings, bag-of-words, word embeddings), rather than phonetic or purely acoustic features, consistent with the hypothesis that cortical representations leveraged by the model include contextual semantics. Together, these results underscore the feasibility of scalable, safe, non-invasive language decoding and highlight the value of contrastive alignment with pretrained speech models for capturing relevant brain–language correspondences.
Conclusion
This work introduces a contrastively trained model that decodes perceived speech from non-invasive MEG/EEG by aligning brain activity with self-supervised speech representations. Across four public datasets (175 participants), the approach achieves strong zero-shot segment identification, with MEG top-10 accuracies up to ~71% (top-1 ~41%) on average and up to ~80% top-1 in the best individuals, and clear gains over baselines. Key contributions include: (1) demonstrating the advantage of targeting pretrained deep speech representations (wav2vec 2.0) over Mel-based targets; (2) a cross-participant architecture with a subject-specific layer, spatial attention, and robust training that requires minimal preprocessing; and (3) evidence that decoding is driven primarily by high-level lexical/semantic features.
Future work should: (a) adapt the framework to speech production (e.g., pairing brain signals with models pretrained on production tasks such as speech/handwriting); (b) augment contrastive decoding with generative components to move beyond candidate selection toward free-form reconstruction; (c) address modality constraints by improving EEG performance and translating MEG advances to clinically viable sensor technologies; and (d) expand datasets to larger, more diverse cohorts to build foundational brain models supporting robust, generalizable decoding in clinical settings.
Limitations
Current decoding performance, while strong for MEG, remains substantially lower for EEG, limiting immediate clinical applicability where EEG is more common. The approach focuses on speech perception, not production; transfer to production is non-trivial due to different neural substrates. The contrastive objective selects from candidate sets and does not generate unconstrained outputs, necessitating generative extensions for open-vocabulary use. Performance varies across participants, and although improved by subject-specific layers, generalization to unseen individuals and sessions may require further adaptation. Differences in devices, sensor layouts, and recording environments can impact decoding; MEG’s superior performance may not translate directly to clinical settings without advances in sensor technology. The method also depends on large datasets and pretrained speech models, which may limit deployment where such resources are scarce.
Related Publications
Explore these studies to deepen your understanding of the subject.

