Computer Science

Neural signals, machine learning, and the future of inner speech recognition

A. T. Chowdhury, A. Hassanein, et al.

Inner speech recognition (ISR) aims to decode covert thought from neural signals using machine learning—ranging from SVMs and random forests to CNNs—combined with signal-preprocessing and cognitive modeling. This review synthesizes ISR methodologies, evaluates challenges and limitations, and outlines future applications in BCIs and assistive communication. Research conducted by Authors present in <Authors> tag.

00:00

~3 min • Beginner • English

Index

Introduction

The paper introduces inner speech (covert speech) as a critical cognitive process underpinning planning, decision-making, and self-regulation. It reviews mechanistic models such as corollary discharge (predictive sensory consequences of planned articulations) and perceptual simulation (reactivation of stored speech-related neural patterns in regions like Broca’s and Wernicke’s). Inner speech disruptions are observed in schizophrenia, frontotemporal dementia, autism, aphasia, and depression, underscoring the value of accurate inner speech recognition (ISR) for BCIs and assistive technologies. The authors highlight technological obstacles—low decoding accuracy, poor generalizability across users, and insufficient real-time performance—creating a need for more robust ISR systems. They position ML as pivotal for decoding neural signals, noting trade-offs among acquisition modalities: EEG (high temporal resolution, low SNR), MEG (better spatial localization, limited practicality), fMRI (excellent spatial but poor temporal resolution), and invasive ECoG (high fidelity but limited clinical contexts). CNNs are emphasized for their effectiveness in capturing complex, non-linear brain activity for ISR.

Literature Review

The review synthesizes cross-disciplinary work on inner speech. Huang et al. (2002) compared cortical activations for silent vs. overt speech using fMRI. Geva et al. (2011) showed post-stroke patients with impaired overt speech also exhibit inner speech deficits, with significant differences (p < 0.05) across inner speech tasks. Stark et al. (2017) found strong correlations (r = 0.95, p < 0.01) between inner speech and overt naming, and between inner speech and mean length of written utterances in chronic aphasia. Simistira Liwicki et al. (2022) achieved 35.20% (vowels) and 29.21% (words) classification with iSpeech-CNN on public EEG datasets; Berg et al. (2021) reported average 29.7% accuracy for word recognition using EEGNet-based CNN across eight subjects (near chance). Kiroy et al. (2022) showed MLP accuracy 49–61% (three classes) and 33–40% (seven classes), outperforming random baselines. Nalborczyk et al. (2020) reported inner speech nonword classification accuracy of 0.472 [95% CI (0.426, 0.518)] vs. 0.847 [95% CI (0.814, 0.876)] for overt speech, highlighting inner speech decoding difficulty. Shepelev et al. (2021) noted low accuracies for intonation classification (confident ~32% ±6%; uncertain ~48% ±5%; neutral ~58% ±8%). Arjestan et al. (2016) advanced BCIs decoding overt, semi-overt, and covert speech from EEG for LIS contexts. Collectively, these works demonstrate progress and persistent challenges in ISR, particularly the gap between inner and overt speech decoding performance and the need for improved methodologies for diverse BCI applications.

Methodology

As a review, the paper structures the ISR pipeline and methodological landscape rather than conducting a single new experiment. Key components include: - Data acquisition modalities: EEG (high temporal resolution, low spatial resolution, typical SNR ~ -0 to 5 dB), MEG (high temporal, medium-high spatial, SNR ~ -3 to 10 dB), fMRI (very high spatial, low temporal, SNR ~ -30 to 40 dB), and ECoG (high temporal and spatial, SNR ~ 10–20 dB). Practicality varies: EEG is portable and low-cost; MEG requires shielded environments; fMRI is bulky and not real-time; ECoG is invasive and limited to neurosurgical patients. - Preprocessing: Band-pass filtering (e.g., 0.5–100 Hz), notch filters (e.g., 50 Hz), Independent Component Analysis (ICA) to remove ocular/muscular artifacts, adaptive filtering (LMS/RLS), wavelet denoising + ICA, and adversarial denoising via GANs and WGAN-GP (reported up to 14.5 dB SNR gains), as well as pix2pix autoencoder GAN for EMG noise removal. Careful electrode placement and artifact-reduction strategies are stressed for EEG. - Datasets: Open and proprietary datasets include: Pressel Coretto et al. (2017) EEG (15 subjects; imagined vs pronounced speech; 50 trials each; 40 imagined, 10 pronounced), Nieto et al. (2022) EEG (10 participants; 136-channel system), Jones & Voets (2021) single-subject 7T fMRI with hours of covert reading/repetition, Shepelev et al. (2021) psychophysical voice database (12 female students; mean age 19.6 ± 0.8), Arjestan et al. (2016) EEG (5 males; overt/semi-overt/covert Persian syllables and vowels), Lee et al. (2021a/b) EEG (speaker classification; 9 subjects; ~300 trials/condition), Martin et al. (2018) ECoG (7 epilepsy patients; platinum-iridium grids). - Mathematical formulations: Decoding framed as regression and classification. Linear regression links stimulus features to neural activity: Y(t) = Σ w(p)·X(t,p), with vector form s = Xw + ε (ε Gaussian noise). Decoders infer most probable stimulus values from brain activity; models include linear methods, HMMs, SVMs, and neural networks. - Architectural frameworks: Traditional ML (SVM, Random Forests, RLDA, Extreme Learning Machines) versus DL (CNNs including EEGNet, LSTM/GRU; CRNNs). Additional modeling includes GLM for fNIRS and Bayesian multilevel linear models. Interpretability tools (SHAP, LIME, saliency maps) are recommended for clinical transparency. - Feature extraction: PSD (Welch) focusing on alpha/beta/gamma bands; CSP for spatial filters; multivariate EMD; band-wise CSP to form compact feature vectors scaled 0–1; extraction of high-gamma and spectrotemporal features for speech representations. - Classification strategies: CNNs/EEGNet often outperform classical models on small word sets but require larger datasets and careful hyperparameter tuning (e.g., LeakyReLU for Shallow/Deep CNNs, ELU for EEGNet). Linear mapping models used for ECoG speech features with gradient descent and jackknife validation. Subject-dependent vs subject-independent training is discussed along with cross-subject transfer and domain adaptation.

Key Findings

- ML, especially CNN-based approaches (e.g., EEGNet), can capture non-linear spatiotemporal patterns of inner speech-related EEG better than traditional feature-based classifiers, though accuracy frequently remains near chance for multi-class word sets. - Reported performance metrics vary widely: Simistira Liwicki et al. (2022) iSpeech-CNN reached accuracy 29.04% over 11 classes with weighted F-score 36.18%; Berg et al. (2021) achieved 29.67% accuracy for 4 words; Kiroy et al. (2022) showed MLP accuracy up to 47.3% for 6-directional words and SVM 43.7%; Jones & Voets (2021) single-subject phoneme-level decoders achieved median test accuracies of 47% (transfer) and 50.82% (replication); Nalborczyk et al. (2020) inner speech nonword classification at 0.472 accuracy vs. 0.847 for overt speech; Shepelev et al. (2021) intonation classification accuracies were low (confident ~32% ±6%, uncertain ~48% ±5%, neutral ~58% ±8%). - Signal acquisition trade-offs: EEG is most practical for ISR but suffers low SNR; MEG provides better localization but is lab-bound; fMRI offers mapping utility but poor real-time viability; ECoG yields highest decoding fidelity in clinical contexts. - Preprocessing advances improve EEG quality: ICA, adaptive filters, wavelet denoising, and adversarial GAN approaches reported up to 14.5 dB SNR improvements, facilitating more reliable feature extraction. - Framework synthesis: The paper organizes ISR into a mathematical and architectural framework comprising acquisition, preprocessing, feature extraction (PSD, CSP, EMD), and classification (SVM, RF, RLDA, ELM, CNN/EEGNet, LSTM/GRU), and highlights interpretability needs (SHAP/LIME). - Generalization challenges: Subject variability, limited datasets, and heterogeneous class designs hinder cross-subject performance and benchmarking; domain adaptation and standardized pipelines are needed.

Discussion

The review addresses the central challenge of decoding inner speech by synthesizing how ML/DL models, combined with careful signal acquisition and preprocessing, can improve ISR performance. CNNs/EEGNet are well-suited to learn spatiotemporal EEG features, but their advantages depend on dataset size, quality, and robust artifact handling. ECoG studies show upper bounds on decoding performance but are not broadly applicable. Reported accuracies underline the gap between inner and overt speech decoding, emphasizing the need for improved features, multimodal integration, and transfer learning to enhance generalization. The work’s structured framework (acquisition → preprocessing → features → models → evaluation) clarifies the ISR landscape and guides the design of more reliable pipelines. Clinical relevance is highlighted through BCI applications for individuals with speech impairments, while interpretability (XAI methods) and ethical safeguards (privacy, autonomy) are crucial for adoption. Overall, the synthesis shows meaningful progress yet underscores the necessity of benchmark datasets, standardized evaluation, and real-time, user-centered systems.

Conclusion

The paper concludes that machine learning—particularly CNN-based architectures like EEGNet—plays a pivotal role in advancing inner speech recognition by effectively extracting and classifying complex neural patterns. It consolidates ISR methodologies into a structured framework and compares model performance across modalities and datasets. While ISR faces significant challenges (low SNR, limited datasets, weak generalizability, interpretability concerns, ethical issues), ongoing improvements in signal acquisition, preprocessing (including adversarial denoising), feature learning, and multimodal fusion promise better accuracy and robustness. Future directions include: developing large, diverse, and open benchmark datasets; integrating XAI for clinical trust; leveraging domain adaptation and transfer learning for cross-subject generalization; exploring hybrid modalities (EEG+fMRI/fNIRS); and building low-latency, real-time pipelines with portable hardware. These steps can enable more effective, inclusive, and responsible ISR systems and deepen scientific understanding of inner speech.

Limitations

Key limitations include: low SNR and artifact contamination in non-invasive recordings (especially EEG), impeding precise decoding; substantial inter-subject variability in neural representations of inner speech, reducing generalizability; scarcity of large, diverse, and publicly available ISR datasets, hindering reproducibility and fair benchmarking; heterogeneity in class sets, preprocessing, and evaluation metrics across studies, limiting direct comparability; interpretability challenges in deep learning models that can affect clinical trust; ethical, privacy, and neurosecurity concerns inherent to decoding highly sensitive inner speech; and practical constraints for real-time deployment, particularly for modalities like fMRI and MEG.

Related Publications

Explore these studies to deepen your understanding of the subject.

Physics

Determination of the Dzyaloshinskii-Moriya interaction using pattern recognition and machine learning

M. Kawaguchi, K. Tanabe, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Medicine and Health

Unsupervised machine learning to investigate trajectory patterns of COVID-19 symptoms and physical activity measured via the MyHeart Counts App and smart devices

V. Gupta, S. Kariotis, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny