
Linguistics and Languages
Decoding speech perception from non-invasive brain recordings
A. Défossez, C. Caucheteux, et al.
This innovative research by Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King unveils a cutting-edge contrastive learning model that decodes speech perception with remarkable accuracy from non-invasive MEG and EEG recordings. With up to 41% accuracy and 80% in the best cases, this study promises a revolutionary approach to understanding language processing.
Playback language: English
Introduction
Decoding speech directly from brain activity is a major goal in both healthcare and neuroscience. Current methods using invasive techniques, such as intracranial recordings, have shown promising results with deep learning models decoding basic linguistic features like letters and words. However, these methods require brain surgery, limiting their widespread application. Non-invasive methods, utilizing magnetoencephalography (MEG) and electroencephalography (EEG), offer a safer alternative, but these signals are notoriously noisy and variable across individuals and sessions. Existing approaches using these non-invasive techniques often rely on handcrafted features and are trained on individual participants, resulting in limited performance and generalization. This research tackles these challenges by introducing a novel model that utilizes a large cohort of participants and leverages self-supervised learning to create robust speech representations. The goal is to develop a deep learning architecture capable of effectively handling noisy, variable non-invasive brain recordings and potentially unknown representations of language in the brain. The study focuses on speech perception in healthy individuals, aiming to establish a robust foundation before translating the findings to speech production in patients with communication impairments. The model employs contrastive learning to align representations from the brain (brain module) with those of a pretrained speech model (speech module, wav2vec 2.0). This approach is designed to address the variability and noise inherent in MEG and EEG recordings while circumventing the need for detailed knowledge of brain language representations.
Literature Review
Extensive research has been conducted on decoding language from brain activity, with varying degrees of success depending on the invasiveness of the recording method and the complexity of the linguistic task. Invasive methods, such as those using electrodes implanted in the cortex, have demonstrated remarkable achievements in decoding phonemes, speech sounds, hand gestures, and articulatory movements. Studies have shown high character-per-minute rates with accuracy exceeding 90% in some cases. However, the invasiveness of these techniques presents a significant barrier to wider application. Non-invasive techniques, such as MEG and EEG, present a safer alternative, but the inherent noise and variability in these signals pose significant challenges. Existing work using MEG and EEG often relies on handcrafted features and single-participant training, resulting in limited performance and generalization. This research directly addresses these limitations by leveraging a large, multi-participant dataset and a novel approach to speech representation learning.
Methodology
The study uses a novel model with two main components: a 'brain module' and a 'speech module'. The brain module processes raw MEG/EEG data using a deep convolutional neural network with a spatial attention layer to focus on relevant sensor locations, a participant-specific layer to account for inter-individual variability, and a series of convolutional blocks with residual connections and dilated convolutions to increase the receptive field. The speech module utilizes the pretrained wav2vec 2.0 model, which has been shown to generate effective representations of speech from raw waveforms, providing deep, contextual representations. The two modules are trained using a contrastive learning objective (CLIP loss) to maximize the alignment between the representations generated by the brain module and the pretrained speech module. Four public MEG and EEG datasets were integrated, comprising data from 175 participants listening to short stories and isolated sentences. The datasets included varying numbers of channels, participants, and recording durations. Preprocessing steps included resampling to 120 Hz, baseline correction, robust scaling, and clamping of outliers. Model evaluation involved segment-level and word-level accuracy assessments using top-1 and top-10 metrics. To gain insight into the type of representations captured by the decoder, further analysis was done by training a linear regression model to predict the decoder's softmax probabilities using different features (phonemes, word frequency, part-of-speech, word embeddings, and phrase embeddings). The effects of different model components were examined via ablation studies.
Key Findings
The model demonstrates remarkable accuracy in decoding speech perception from non-invasive brain recordings. The top-10 accuracy on MEG data reached up to 70.7% on average across participants, with top-1 accuracy reaching 41.3% on average and exceeding 80% in the best performing participants. This performance was significantly higher than chance and also consistently outperformed alternative baselines that employed different loss functions or speech representations. MEG consistently outperformed EEG in decoding accuracy, suggesting that the difference in performance was primarily due to the recording modality rather than recording duration or number of sensors after controlling for these factors. Ablation studies highlighted the importance of various model components, including contrastive learning, the spatial attention module, skip connections, and both the initial and final convolutional layers of the brain module. Furthermore, the number of participants in the training set had a positive impact on decoding performance, indicating the effectiveness of the multi-participant training approach. Analysis of the decoder's predictions revealed a strong correlation between decoding accuracy and higher-level semantic features (phrase embeddings), suggesting the model primarily captures lexical and contextual information. Comparison with existing research shows this model significantly outperforms previous attempts at decoding non-invasive brain activity regarding vocabulary size and zero-shot performance.
Discussion
The study successfully demonstrates that accurate decoding of speech perception is possible from non-invasive MEG and EEG recordings using a novel model that leverages contrastive learning and pretrained speech representations. The findings address the limitations of previous approaches by effectively handling noisy signals and leveraging high-level semantic information. The superior performance of the model compared to baselines underscores the significance of the chosen methodology. The multi-participant architecture, combined with the contrastive learning objective and the utilization of pretrained speech representations, enables robust generalization across individuals and datasets. The reliance on high-level semantic representations suggests that the model effectively captures the meaningful aspects of speech as represented in the brain. These results provide strong evidence supporting the feasibility of non-invasive language decoding and offer a crucial stepping stone towards developing brain-computer interfaces for patients with communication disorders.
Conclusion
This study makes several significant contributions. It demonstrates the efficacy of leveraging pretrained speech models for non-invasive brain decoding, avoiding tedious word repetition. It highlights the benefits of specific design choices, including contrastive learning and a multi-participant architecture, in processing continuous EEG and MEG recordings. Finally, it suggests that high-level semantic representations are crucial for effective decoding. Future work should focus on adapting the model for speech production and incorporating a generative module to enhance its capabilities. Further investigation into the interaction between EEG and MEG for decoding is also warranted.
Limitations
The study primarily focuses on speech perception in healthy individuals, limiting the direct applicability to speech production in clinical settings. While the model shows robustness to noise and artifacts, further testing is needed to evaluate its performance in real-world clinical scenarios with more challenging recording conditions. The reliance on publicly available datasets, while beneficial for reproducibility, may limit the generalizability of findings to other datasets with different recording parameters and experimental paradigms. Although a large dataset was used, the inclusion of more diverse datasets and participants would further strengthen the model's generalizability.
Related Publications
Explore these studies to deepen your understanding of the subject.