Introduction
Parkinson's disease (PD), the second most common neurodegenerative disorder, presents significant challenges due to its slow progression and delayed diagnosis. Early detection is crucial to mitigate disease progression, and several "soft signs," including changes in facial expression (hypomimia) and voice, often precede classical motor symptoms. Current diagnostic methods often rely on expensive and inconvenient wearable sensors. This study proposes a more accessible and affordable approach using readily available smartphones to capture voice and facial expressions, analyzing these features with machine learning algorithms to identify early-stage PD. The study’s primary objective is to develop and validate a deep learning model capable of distinguishing early-stage PD patients from healthy age-matched controls using integrated biometric features derived from voice and facial expression analysis. The convenience and affordability of using a smartphone camera and microphone make this approach attractive for widespread population screening, particularly in areas with limited access to specialized neurological care.
Literature Review
Existing research highlights the potential of both voice and facial expression analysis in detecting early-stage PD. Studies have shown linguistic changes and facial bradykinesia (reduced facial movements) years before a clinical diagnosis. While wearable sensor technology offers reliable data, its cost and the requirement for active participation limit its widespread use. Smartphone-based applications exist, focusing primarily on motor features such as arm swing, but these are not PD-specific. A multimodal approach combining voice and facial features, processed through machine learning, offers a more sensitive and specific diagnostic tool. Previous research has explored the use of digital biomarkers, including motor features detected through wearable sensors, and facial expression analysis for PD diagnosis, but a comprehensive approach integrating both voice and facial features remains relatively unexplored. This gap in research motivates the development and validation of the proposed integrated model.
Methodology
This study enrolled 371 participants: 186 patients with PD and 185 age- and sex-matched healthy controls recruited from the National Taiwan University Hospital. PD diagnosis followed the UK Parkinson's Disease Society Brain Bank criteria. Participants underwent simultaneous voice and facial expression recordings using an iPhone 8 Plus while reading a 500-word article. Facial landmarks were extracted using Google MediaPipe Face Mesh, generating features including eye blinking, mouth movement variance, and mouth angle variance. Voice features included reading time, phonetic score, pause percentage, volume variance, and pitch variance. Data from 186 PD patients and 185 controls were split into training (112 PD patients "on" phase, 111 controls) and validation (74 PD patients "off" phase or drug-naïve, 74 controls) cohorts. Nine machine learning classifiers (C4.5 decision tree, k-Nearest Neighbor, support vector machine, Naïve Bayes, random forest, logistic regression, gradient boosting machine classifier, AdaBoost, and Light Gradient Boosting Machine) were applied using sequential forward feature selection. The models' performance was evaluated using metrics including AUROC, accuracy, precision, recall, and F1-score.
Key Findings
The integrated analysis of voice and facial expression features significantly discriminated between PD patients and controls. In the training cohort, the combined model using logistic regression and random forest classifiers yielded AUROC values of 0.85 and 0.84 respectively for distinguishing all PD patients from controls. Analyzing early-stage PD patients, the AdaBoost classifier achieved an AUROC of 0.84. The validation cohort, comprising patients in the "off" phase or drug-naïve, demonstrated even stronger performance, with a random forest classifier achieving an AUROC of 0.90 for discriminating all PD patients from controls. The AdaBoost classifier performed well in identifying early-stage PD patients in the "off" phase. Analysis of individual modalities (facial features alone and voice features alone) showed lower diagnostic accuracy compared to the integrated model. Notably, eye blinking emerged as a key facial feature for differentiating between PD patients and controls. While facial features alone showed a lower AUROC (0.69), the integrated model significantly improved accuracy, highlighting the synergistic effect of combining voice and facial data.
Discussion
The high AUROC values achieved in both the training and validation cohorts demonstrate the potential of this integrated approach for early PD detection. The model's robustness is evident in its strong performance even when applied to drug-naïve patients or those in the "off" phase, mimicking a real-world clinical scenario. The study's findings support previous research indicating the value of both voice and facial features in PD diagnosis. While individual modalities showed some diagnostic power, the integrated approach significantly enhanced sensitivity and specificity. This highlights the complementary nature of the information captured by voice and facial expression analysis, suggesting that integrating multiple data sources can improve the accuracy of early PD detection. The model’s good performance with drug-naïve patients suggests its potential for real-world screening.
Conclusion
This study successfully demonstrates the potential of integrating voice and facial expression analysis, processed via machine learning, for early PD detection. The high AUROC values achieved across both training and validation cohorts, especially with drug-naïve and "off" phase patients, validate the robustness and clinical relevance of this approach. Future research should focus on longitudinal studies with larger, more diverse cohorts to further refine the model and investigate its generalizability across different populations and languages. Expanding the feature set and exploring advanced deep learning architectures may further improve diagnostic accuracy and contribute to earlier and more accurate PD diagnosis.
Limitations
Several limitations warrant consideration. The study did not account for jaw and voice tremors in patients, which may affect voice analysis. A correlation between speech/facial features and limb movement difficulties was not established. Patients with depression were excluded, representing a potential confounding factor. Finally, the lack of longitudinal data prevents evaluation of the model's ability to track disease progression.
Related Publications
Explore these studies to deepen your understanding of the subject.