
Engineering and Technology
Efficient Pause Extraction and Encode Strategy for Alzheimer's Disease Detection Using Only Acoustic Features from Spontaneous Speech
J. Liu, F. Fu, et al.
Discover an innovative method for detecting Alzheimer's Disease through speech analysis! This research, conducted by Jiamin Liu, Fan Fu, Liang Li, Junxiao Yu, Dacheng Zhong, Songsheng Zhu, Yuxuan Zhou, Bin Liu, and Jianqing Li, reveals how extracting speech pauses and utilizing advanced machine learning can significantly improve diagnosis accuracy. The findings highlight the potential of acoustic features in health technology.
~3 min • Beginner • English
Introduction
Alzheimer's disease (AD) is an irreversible neurodegenerative disorder with growing global prevalence, making early detection critical for intervention. Speech and language changes occur early in AD, and clinical screening often relies on subjective neuropsychological assessments. Automated detection using spontaneous speech offers a scalable alternative. While text-based methods can achieve high performance, they depend on accurate transcription and may not generalize well across languages and accents. Acoustic-only approaches are more robust to language differences and extract information directly from the signal. Prior work has used standardized acoustic feature sets (ComParE, eGeMAPS) and other signal-processing features, as well as pause characteristics, to distinguish AD from controls. Silent pauses, which are easier to derive directly from audio than filled pauses, have shown strong associations with cognitive decline and are practical for daily conversation analysis. This study investigates silent pauses as a primary cue for AD detection using only acoustic signals. The authors propose a voice activity detection (VAD)-based pause feature (VAD Pause) encoded as binary sequences and an ensemble machine-learning framework. The approach is evaluated on two English public datasets (ADReSS, ADReSSo) and a small local Chinese dataset. The main contributions are: (1) introduction of the VAD Pause feature on English datasets and testing on a Chinese dataset; (2) a machine-learning ensemble using only acoustic features; (3) statistical evidence that VAD Pause and the ensemble outperform standard acoustic feature sets.
Literature Review
Prior studies have explored both text-based and acoustic-only methods for AD detection from spontaneous speech. Text-plus-acoustic systems have reported high accuracies (e.g., Yuan et al. achieving up to 89.6%), and end-to-end deep learning approaches have shown strong AUCs (e.g., 0.846). However, these approaches can be affected by transcription errors and may not generalize to non-native speakers or dialects. Acoustic-only approaches avoid these issues. Luz used spontaneous picture description speech from the Pitt database and achieved around 68% accuracy using statistical and nominal acoustic features with a Bayesian classifier. Standardized feature sets—ComParE (6373 features) and eGeMAPS (88 features)—enable comparable research across tasks and have been widely used. Other signal-processing features (higher-order spectra, fractal, wavelet-packet) have also been investigated for AD. Silent pauses have particular relevance: AD patients typically show more and longer pauses due to lexical retrieval difficulties and reduced fluency. Several studies have examined pause distribution, frequency, duration, and position, often finding silent pauses to be more discriminative than filled pauses and easier to obtain without transcription. Yuan et al. encoded pauses into discrete categories based on duration for text-based models, achieving high performance but still relying on transcripts and discrete representations. This work focuses on directly encoding silent pauses from audio via VAD to capture temporal pause patterns without transcription and to improve generalization across languages.
Methodology
System framework: Participants performed the "Cookie Theft" picture description task while being recorded. Audio was preprocessed, features were extracted, and classifiers were trained at the segment level; final subject labels were obtained via majority voting across segments.
Datasets: Two public English datasets were used—ADReSS (156 participants) and ADReSSo (237 participants), both balanced for age and gender and encompassing MCI and dementia as AD vs matched controls. Each dataset was split 70% train and 30% test. A local Chinese dataset of 10 recordings (5 AD, 5 controls) was collected at Nanjing Brain Hospital under IRB approval and informed consent, with MMSE-based inclusion criteria varying by education and standard exclusion criteria. Recordings (25–183 s) were captured with a SONY ICD-TX660.
Preprocessing: All audio was converted to WAV, resampled to 44,100 Hz, and converted to mono.
Feature extraction:
- VAD Pause feature: Audio was framed at 0.03 s and each frame classified as voiced (1) or non-voiced/silent (0) using Google WebRTC VAD. The method computes sub-band (six bands: 80–250 Hz, 250–500 Hz, 500 Hz–1 kHz, 1–2 kHz, 2–3 kHz, 3–4 kHz) energies and compares total energy to a threshold. For each sub-band, likelihoods under silent and voiced Gaussian mixture models are computed; sub-band and total log-likelihood ratios are formed and compared to thresholds to decide voiced vs silent per frame. The resulting binary sequence serves as the pause feature.
- Common acoustic feature sets: Using openSMILE, ComParE 2013 (6373 features, including energy, spectral, MFCC, HNR, voice quality, etc., with statistical functionals) and eGeMAPS (88 features, covering F0 semitone, jitter, shimmer, loudness, spectral flux, MFCC, formants, alpha ratio, Hammarberg index, slope V0, with functionals) were extracted for comparison and ensemble fusion.
Ensemble classification and voting: Recordings were segmented into 4-second segments. Five classic machine-learning classifiers were trained at segment level: linear discriminant analysis (LDA), decision trees (DT), k-nearest neighbors (KNN), linear SVM (SMO solver), and tree bagger (TB). Subject-level labels were obtained by majority voting across segment-level predictions. An ensemble fused outputs from different feature sets (VAD Pause with ComParE or eGeMAPS) using majority voting, with TB identified as the most accurate and stable classifier.
Evaluation and statistics: Binary classification metrics included accuracy, precision, recall, and F1. For fair comparison with prior work, classifier parameters followed the literature; each classification was repeated five times with data shuffling, reporting averages. Two-way ANOVA tested effects of feature type (ComParE, eGeMAPS, VAD Pause, ensemble) and classifier (LDA, DT, KNN, SVM, TB) on accuracy; significant interactions were further examined with multiple comparisons to identify the best feature–classifier combinations.
Key Findings
- Visual analysis showed AD speakers produced more and longer pauses with poorer speech coherence than controls; VAD Pause sequences highlighted these differences clearly.
- On ADReSS (test set), VAD Pause outperformed ComParE on several classifiers; with TB, results were: accuracy 65.4%, F1 67.9%, precision 63.4%, recall 73.3%, exceeding the acoustic baseline accuracy of 62.5%.
- On ADReSSo (test set), VAD Pause outperformed eGeMAPS for most classifiers; with TB, results were: accuracy 65.6%, F1 62.3%, precision 68.3%, recall 58.3%, exceeding the acoustic baseline accuracy of 64.8%.
- Ensemble fusion further improved performance and stability. Reported gains over several baseline methods exceeded 5% (approximately +8% accuracy on ADReSS and +5.9% on ADReSSo). With TB, ensemble accuracies reported in literature comparison were 70.0% (ADReSS, VAD Pause + ComParE) and 70.7% (ADReSSo, VAD Pause + eGeMAPS), competitive with non-DL methods and approaching some DL approaches.
- Local Chinese dataset: The pause-sequence-based method achieved 80% accuracy, correctly identifying most AD patients; some controls with higher pause frequencies were misclassified, likely reflecting language-specific pausing characteristics.
- Two-way ANOVA indicated both features and classifiers significantly affected accuracy (p < 0.05), with significant interactions. Best-performing combinations included: ensemble with DT, ensemble with SVM, VAD Pause with TB, and ensemble with TB.
- Overall, VAD Pause features had higher classification value than ComParE/eGeMAPS alone and generalize across English and Chinese data.
Discussion
The findings support silent pauses as an informative biomarker of cognitive impairment in AD, reflecting increased disfluency and lexical retrieval difficulties. By encoding pauses directly from audio with VAD, the method avoids transcription errors and language dependencies inherent in text-based approaches, enabling more practical and generalizable screening. Across two English datasets, VAD Pause features consistently outperformed standard acoustic feature sets on several classifiers, and ensemble fusion further improved accuracy and reduced variance. Results on a small Chinese dataset, trained on English data, suggest cross-lingual robustness of pause-based detection, though language-specific pausing patterns can affect control classification. Statistical analyses confirmed that both feature choice and classifier selection contribute to performance; tree-based ensembles (TB) and the proposed ensemble fusion yielded robust, high accuracies among non-deep-learning methods. Compared to deep learning approaches, the proposed method trades a small amount of accuracy for simplicity, interpretability, and lower resource requirements. The work highlights opportunities to combine VAD Pause with acoustic embeddings or other features to further enhance accuracy while maintaining practicality. The approach may also extend to other acquired language disorders where silent pauses are informative.
Conclusion
This study introduces a VAD-based pause feature (VAD Pause) that encodes voiced vs silent frames into binary sequences for AD detection using only acoustic signals. Evaluations on ADReSS, ADReSSo, and a local Chinese dataset show that VAD Pause outperforms standard acoustic feature sets, and that an ensemble fusion strategy further boosts accuracy and stability. The approach generalizes across languages and avoids reliance on transcription, offering a practical screening tool. Future work will collect larger local datasets, explore integration with deep learning (e.g., acoustic embeddings and neural classifiers), and adapt the method to other acquired speech and language disorders.
Limitations
- The local dataset is small (n = 10) and not publicly available, limiting generalizability and independent replication.
- Deep learning classifiers were not explored; while non-DL methods are practical, DL may yield higher accuracy with sufficient data.
- Language-specific pausing characteristics (e.g., in Chinese with dialectal influence) led to some misclassification of controls, indicating potential sensitivity to linguistic and cultural factors.
- Evaluations were limited to the picture description task; performance on other task types or free conversation was not assessed.
- Reported improvements were benchmarked against specific baselines and classic ML setups; broader comparisons across varied preprocessing and classifier settings were not exhaustively explored.
Related Publications
Explore these studies to deepen your understanding of the subject.