Medicine and Health
An integrated biometric voice and facial features for early detection of Parkinson’s disease
W. S. Lim, S. Chiu, et al.
Parkinson’s disease (PD) is a prevalent neurodegenerative disorder causing motor and non-motor disability, with cases expected to rise substantially. Although dopaminergic therapies alleviate some motor symptoms, disease progression continues, underscoring the need to identify individuals in premonitory or early stages to enable timely interventions. Several soft signs can precede classical motor dysfunction, including reduced facial expressivity (hypomimia), voice changes (reduced volume and tempo, frequent pauses, shortened speech), and altered gait. These subtle features are often misattributed to normal aging, delaying diagnosis. Linguistic changes can occur up to five years before PD diagnosis, and hypomimia is considered a sensitive PD characteristic. Existing digital biomarkers often rely on wearable sensors focused on motor features, which may be costly, require active participation, and lack specificity. In contrast, voice and facial expression analysis via smartphones or webcams is accessible and suitable for remote assessment. Single-domain modalities may lack PD specificity; thus, integrating multimodal biometric features with machine-learning could improve early-stage detection. This study aimed to develop and evaluate a deep learning model that integrates voice and facial expression features to distinguish early-stage PD patients from age- and sex-matched healthy controls.
Prior work on digital biomarkers for PD has focused largely on motor assessments using wearable sensors, which, although reliable, can be expensive and burdensome for large-scale use. Studies have explored facial expression analysis, sometimes using task-based expressions (e.g., smiling, disgust, surprise), achieving high accuracies in controlled settings. Speech-based biomarkers have shown promise, with AUROC values around 0.86–0.93 in discriminating PD from controls, and machine learning approaches (e.g., SVM) showing moderate accuracy in prodromal cohorts (e.g., RBD). However, many prior studies used constrained tasks or lab environments. The present work addresses these gaps by analyzing natural facial micro-expressions during a reading task and voice features recorded via smartphone in realistic acoustic environments, and by integrating modalities to potentially enhance diagnostic performance, particularly for early-stage PD.
Study design and participants: A total of 371 participants (186 PD, 185 controls) were recruited at National Taiwan University Hospital. PD diagnosis followed UK PD Society Brain Bank Clinical Diagnostic Criteria. Controls were neurocognitively unaffected spouses or acquaintances. Participants underwent neurological and psychiatric evaluations; those illiterate or with inadequate visual acuity, significant depressive symptoms, or other exclusion criteria were excluded. Motor severity was evaluated with MDS-UPDRS Part III and Hoehn–Yahr (HY) staging; early-stage PD was defined as HY < 3 and advanced-stage as HY ≥ 3. All participants provided informed consent; institutional ethics approval was obtained. Cohorts: Data were split into a training set (112 PD patients during the “on” phase and 111 controls) and an independent validation set (74 PD during the “off” phase or drug-naïve and 74 controls). “On” phase was within 3 h after last dopaminergic dose; “off” phase was >12 h after last dose. Rationale: models trained on more subtle “on”-phase differences were expected to generalize to “off”/drug-naïve states and real-world screening. Data acquisition: Participants read an approximately 500-word article while simultaneously recorded on smartphone video/audio positioned ~35 cm from the face (iPhone 8 Plus). Video recordings were also performed with an SJCAM SJ4000 camera. Voice was recorded in linear PCM (.wav) at 441 kHz/24-bit, then converted to 44.1 kHz/16-bit. Facial feature extraction: Facial landmarks were extracted with Google MediaPipe Face Mesh. Automatically computed features (both sides where applicable) included: (1) eye blinking via eye aspect ratio (EAR) with thresholds at 30/50/70/90% of mean (total blink time over frames), (2) mouth-to-eye distance variance (right/left), (3) mouth height variance, (4) mouth width movement variance, (5) mouth angle variance (angle between horizontal mouth width and vertical mouth height lines), and (6) peri-oral area movement variance (right/left) using distances from nose center to six peri-oral/jaw landmarks. Voice feature extraction: Six features were computed focusing on volume and pitch: (1) reading time (start-to-finish), (2) phonetic score (articulatory model; smiralab.org), (3) pause percentage (25 ms frames below participant-specific volume threshold), (4) volume variance (difference between average volume of first vs. second halves normalized by total average volume), (5) average pitch via zero-crossing rate computed at 10 ms frames, and (6) pitch variance defined by average absolute frame-to-frame change in zero-crossing counts. Feature selection and modeling: Sequential forward selection (SFS) identified salient features per classifier by optimizing performance. Nine classifiers were evaluated: C4.5 decision tree, k-Nearest Neighbor, support vector machine, Naïve Bayes, random forest, logistic regression, gradient boosting machine, AdaBoost, and LightGBM. Performance metrics included accuracy, precision, recall, F1-score, and AUROC. Ten-fold cross-validation was used in training for objective performance estimation. ROC analyses summarized diagnostic performance; selected features per classifier are reported in supplementary figures/tables. Statistical analysis: Continuous variables are presented as mean ± SD; categorical variables as counts/percentages. Homogeneity of variances was tested with Levene’s test. Group comparisons employed two-tailed t tests or ANOVA for normally distributed data, or Kruskal–Wallis for non-normal data. AUROC with 95% CIs quantified diagnostic performance. Analyses used SAS 9.4 and GraphPad Prism 9; significance at p < 0.05.
- Integrated facial and voice features discriminated early-stage PD from controls with AUROC 0.85 in the training dataset.
- In the independent validation cohort (patients in “off” phase/drug-naïve vs. controls), the optimal diagnostic value was AUROC 0.90 using a random forest-based model trained on combined features.
- Voice-only models: AdaBoost achieved AUROC 0.86 (training), indicating strong diagnostic value from acoustic features alone.
- Combined features (voice + facial) without demographics: logistic regression achieved AUROC 0.84 (training).
- With integrated selected features (voice + facial ± age/sex), logistic regression and random forest achieved AUROC 0.85 and 0.84, respectively, for distinguishing all PD patients from controls (training).
- Early-stage PD (“on” phase) vs. matched controls (training): AdaBoost achieved AUROC 0.84.
- Facial-only analysis: a random forest classifier achieved AUROC 0.69 for all PD vs. controls, indicating lower discriminative power of natural, task-free facial movements alone during reading.
- Acoustic differences observed in PD vs. controls included longer reading times, higher pause percentage, and reduced pitch and volume variance.
- Sample sizes: total N=371 (PD=186; controls=185). Training: 112 PD (on) + 111 controls. Validation: 74 PD (off/drug-naïve) + 74 controls.
The study addressed whether multimodal biometric analysis of naturalistic facial expressions and voice could aid early PD identification. Results demonstrate that integrating facial and acoustic features with machine-learning classifiers can effectively distinguish PD, including early-stage cases, from controls. While facial-only features from natural reading yielded modest discrimination (AUC 0.69), integration with voice features substantially improved performance, highlighting complementarity across modalities. Eye blinking emerged as a key facial marker, whereas other mouth-related movement variances showed limited differences during reading. Voice features—particularly temporal and prosodic characteristics like pauses, pitch variance, and volume variance—contributed strongly to detection, consistent with known PD-related dysprosody. Notably, models trained on subtler “on”-phase differences generalized well to an independent “off”/drug-naïve cohort (AUC up to 0.90), supporting potential utility for community screening. Compared with prior studies using controlled facial tasks or lab-based speech protocols, this work shows that smartphone-based recordings in realistic conditions, combined with feature selection and diverse classifiers, can yield robust discrimination, suggesting feasibility for remote, scalable assessments.
Integrated biometric analysis of voice and facial expressions, coupled with deep-learning classifiers, can assist in identifying early-stage Parkinson’s disease from age-matched controls. The approach achieved strong diagnostic performance in both training and independent validation cohorts, supporting feasibility for real-world, smartphone-based screening. Future work should include longitudinal, prospective studies across larger and more diverse populations and languages, incorporate patients with comorbid conditions (e.g., depression), and explore integration with complementary wearable-sensor biomarkers (e.g., gait, REM sleep behavior) into a multimodal platform for early detection and disease monitoring.
- Patients with jaw or voice tremor were not subgrouped; their inclusion may have affected speech analyses.
- No correlation was performed between speech/facial features and limb motor difficulties; relationships with motor symptom severity and progression remain unclear.
- PD patients with depression were excluded to avoid confounding acoustic effects; however, depression is common in PD, limiting generalizability. Future studies should include cohorts with and without depression.
- No serial recordings were performed; day-to-day variability in vocal and facial features may have influenced results and cannot be assessed.
- Facial analysis relied on natural expressions during reading rather than task-based expressions, potentially reducing facial-only discriminability.
Related Publications
Explore these studies to deepen your understanding of the subject.

