logo
ResearchBunny Logo
Efficient Pause Extraction and Encode Strategy for Alzheimer's Disease Detection Using Only Acoustic Features from Spontaneous Speech

Engineering and Technology

Efficient Pause Extraction and Encode Strategy for Alzheimer's Disease Detection Using Only Acoustic Features from Spontaneous Speech

J. Liu, F. Fu, et al.

Discover an innovative method for detecting Alzheimer's Disease through speech analysis! This research, conducted by Jiamin Liu, Fan Fu, Liang Li, Junxiao Yu, Dacheng Zhong, Songsheng Zhu, Yuxuan Zhou, Bin Liu, and Jianqing Li, reveals how extracting speech pauses and utilizing advanced machine learning can significantly improve diagnosis accuracy. The findings highlight the potential of acoustic features in health technology.

00:00
Playback language: English
Introduction
Alzheimer's Disease (AD) is a debilitating and irreversible neurodegenerative disease affecting millions globally. Early detection is crucial for intervention and delaying cognitive decline. Deterioration in speech and language is an early indicator of AD, but current clinical evaluations are subjective and lack quantitative measures suitable for home testing. While research has explored the use of spontaneous speech for AD detection, methods relying on text-based features are susceptible to transcription errors, especially with non-native speakers or those with accents. Acoustic feature-based approaches offer robustness to language variations. Previous studies have shown that speech pauses correlate with cognitive function, with AD patients exhibiting increased pauses and disfluency. However, existing methods for pause analysis often rely on manual transcription or discrete representations of pause duration, potentially missing crucial information. This research addresses these limitations by proposing a novel method for pause extraction and encoding using only acoustic signals, aiming to provide a simple, accessible, and generalizable screening tool for AD detection.
Literature Review
Several studies have focused on using spontaneous speech for AD detection. Yuan et al. achieved high accuracy using text-based and acoustic features, but their approach relied on transcription accuracy. Agbavor et al. developed an end-to-end AD detection method, but it also faced limitations. Acoustic-only methods have shown promise, with Luz achieving 68% accuracy using statistical and nominal acoustic features. Standardized feature sets like ComParE and eGeMAPS facilitate comparison across studies. Researchers have also explored higher-order spectral features, fractal features, and wavelet-packet-based features. Studies focusing on speech pauses have demonstrated their value in AD detection, with Vincze et al. showing differences in pause length, number, and rate between AD patients and controls. Yuan et al. categorized pauses by duration and punctuation, achieving high accuracy, but their method still relied on speech transcription. This study builds upon this prior work by focusing on a more efficient and language-independent approach to pause feature extraction and encoding.
Methodology
The proposed method consists of several stages. First, spontaneous speech data is collected from participants performing a picture description task. The speech signals undergo preprocessing, including format conversion, sample rate normalization, and channel conversion to a mono signal. Two types of acoustic features are extracted: (1) a novel VAD (Voice Activity Detection)-based pause feature (VAD Pause) and (2) established acoustic feature sets ComParE and eGeMAPS. The VAD Pause feature is created by segmenting the speech into audio frames (0.03s duration) and utilizing the WebRTC VAD method to classify each frame as voiced or non-voiced. This generates a binary pause sequence. The WebRTC VAD uses a Gaussian Mixture Model (GMM) to model the energy distribution in different frequency sub-bands, determining whether a frame is voiced based on log-likelihood ratios compared to predefined thresholds. ComParE and eGeMAPS are extracted using the openSMILE toolkit. Five classic machine learning models (LDA, DT, KNN, SVM, TB) are used to classify AD using both individual feature sets and an ensemble method based on majority voting. The ensemble method combines predictions from the five classifiers for each segment (4-second segments), and a final classification is made by majority voting across all segments of a recording. Performance evaluation uses accuracy, precision, recall, and F1-score. Statistical analysis (two-way ANOVA and multiple analysis) assesses the significance of feature and classifier effects. The method is evaluated on two public English datasets (ADReSS and ADReSSo) and a local Chinese dataset.
Key Findings
The visual analysis of raw waveforms and VAD Pause features showed noticeable differences between AD and non-AD subjects, with AD subjects demonstrating more pauses and less speech coherence. Quantitative results from the ADReSS and ADReSSo datasets revealed that the VAD Pause feature consistently outperformed both ComParE and eGeMAPS across several machine learning classifiers, particularly the Tree Bagger (TB) classifier, showing improved accuracy, F1-score, precision, and recall. The ensemble method further improved classification performance compared to single feature set based methods, increasing accuracy by 8% on ADReSS and 5.9% on ADReSSo. The ensemble method using TB achieved the highest accuracy and stability among classifiers. The VAD Pause feature achieved approximately 80% accuracy on the local Chinese dataset, demonstrating potential for cross-lingual generalization. Two-way ANOVA indicated significant effects of both features and classifiers on classification accuracy, with a highly significant interaction between them. Multiple analysis showed that the ensemble method with TB, along with other combinations, achieved the best classification accuracy, and the proposed methods showed significantly higher classification accuracies compared to the public feature set-based method.
Discussion
The study's findings demonstrate the effectiveness of the proposed VAD Pause feature and ensemble method for AD detection using only acoustic features from spontaneous speech. The superior performance of the VAD Pause compared to ComParE and eGeMAPS highlights the discriminative power of pause information in distinguishing between AD and non-AD individuals. This suggests that pauses in speech reflect underlying cognitive impairments associated with AD. The improved performance of the ensemble method underscores the complementary nature of different features and classifiers. The successful application of the method to a local Chinese dataset suggests its potential for cross-lingual applicability and robustness to language variations, offering an advantage over methods that rely on textual transcription. This is particularly important as it eliminates the need for accurate transcription, reducing time and cost and mitigating errors associated with automated transcription. The results suggest that the method could be used for efficient and accessible AD screening, particularly in clinical settings or home testing scenarios. The ability to detect AD directly from speech without textual transcription is a significant advantage, particularly for non-native speakers and individuals with diverse accents. This further strengthens the clinical utility of the approach.
Conclusion
This study introduced a novel VAD Pause feature for AD detection using only acoustic features from spontaneous speech. The proposed method, incorporating an ensemble of classifiers, demonstrated superior performance compared to methods using established feature sets (ComParE and eGeMAPS) on both public and local datasets. The method's accuracy and robustness to language variations suggest its potential as a valuable tool for efficient and accessible AD screening. Future research will focus on further enhancing the method's accuracy by combining the VAD Pause feature with other acoustic embeddings and exploring the use of deep learning models. The method's potential application to detecting other speech disorders should also be investigated.
Limitations
The study's limitations include the relatively small size of the local Chinese dataset, which could affect the generalizability of the findings for Chinese populations. Although the results suggest cross-lingual generalizability, a larger and more diverse dataset across various languages is needed to rigorously validate this claim. The study mainly focused on five classic machine learning models and did not explore deep learning-based classifiers, which might yield further improvements in accuracy. Future work could include incorporating deep learning models and applying the approach to a broader range of cognitive impairments.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny