Medicine and Health
Automatic speech-based assessment to discriminate Parkinson's disease from essential tremor with a cross-language approach
C. D. Rios-urrego, J. Rusz, et al.
Essential tremor (ET) and Parkinson’s disease (PD) are common movement disorders with overlapping clinical features, leading to frequent misdiagnosis, particularly in early stages. PD is associated with hypokinetic dysarthria, while ET typically presents hyperkinetic dysarthria driven by tremor. Despite distinct speech manifestations, speech-based differentiation between PD and ET has not been previously established, and cross-language variability complicates the development of universal speech assessment tools. The authors propose that speech signals contain discriminative information enabling automatic differentiation between PD and ET and hypothesize that language-robust models can be built via domain adaptation using Gaussian mixture model–universal background model (GMM-UBM) techniques trained in one language and adapted to another. The study aims to: (1) determine how effectively speech-based models can distinguish PD from ET; (2) identify which speech dimensions (articulation, phonation, prosody) most strongly differ between these pathologies; and (3) assess which training language(s) yield greater compatibility for cross-language adaptation.
Prior differential diagnosis studies between PD and ET have relied on modalities such as neurological video examinations, hand tremor signals, gait, EMG, handwriting, and imaging, demonstrating feasibility but not utilizing speech. Speech-based differentiation between hypokinetic (PD) and hyperkinetic (e.g., Huntington’s disease) dysarthrias has been shown, yet PD vs. ET via speech had not been investigated. Cross-language research indicates challenges due to phonetic variability and recording standards, but some findings suggest language differences may not critically impact assessment of disease phenotypes. Transfer learning and multi-language strategies have been explored in PD speech analysis, motivating cross-language domain adaptation to build robust, generalizable models for pathology classification and monitoring.
Study design: Two classification settings using Czech speech as target domain: (A) binary classification PD vs. ET; (B) tri-class classification PD vs. ET vs. healthy controls (HC). Two tasks per speaker: spontaneous monologue and diadochokinetic /pa-ta-ka/ repetitions. Participants (Czech target set): 50 PD (20 females), 50 ET (20 females), and 50 HC (20 females); native Czech speakers. Recordings in a quiet room using a head-mounted condenser microphone (Beyerdynamic Opus 55) ~5 cm from mouth corner; downsampled to 16 kHz; dysarthria recording guidelines followed. PD patients recorded ON medication. Clinical scores: PD UPDRS-III mean 20.1; ET TETRAS mean 34.6. Groups were balanced in age and gender; average task durations: /pa-ta-ka/ ≈7–8 s; monologue ≈117–150 s. Training corpora for UBM (source domains): Parkinson’s corpora: PC-GITA (Spanish): 50 PD, 50 HC; German PD corpus: 88 PD, 88 HC. Additional large speech corpora: CIEMPIESS (Mexican Spanish; ~17 h; 141 speakers), Verbmobil (German; ~29 h; 586 speakers). UBMs were trained using HC speech only (patient-inclusive UBMs yielded worse, unstable performance). Speech features (78 total) extracted with DisVoice after DC offset removal and amplitude normalization:
- Articulation (58 features): derived from unvoiced-to-voiced onset transitions segmented via F0 presence (Praat). For each onset, 80 ms windows (±40 ms). Features include 22 Bark-band energies, 12 MFCCs and their first and second derivatives; computed on 40 ms frames with 20 ms shift within onset segments.
- Phonation (7 features): computed over voiced segments: first and second derivatives of F0, jitter, shimmer, APQ, PPQ, and log energy per frame.
- Prosody (13 features): per voiced segment duration; coefficients of 5th-order polynomials modeling F0 contour and energy contour (Lagrange formulation for energy). Modeling via GMM-UBM and MAP adaptation:
- For each speech dimension (articulation D=58, phonation D=7, prosody D=13), train UBMs on source-language HC data (Spanish, German, or combined; also CIEMPIESS and/or Verbmobil). Number of Gaussians M ∈ {2,4,8,16,32,64,128} tuned per experiment.
- Adapt speaker-specific GMMs for Czech recordings using MAP adaptation from the chosen UBM. Responsibilities Pr(i|x) computed; sufficient statistics n_i, E_i(x), E_i(x^2) estimated; parameters updated with adaptive coefficient α = r/(r+n_i), r=16, ensuring mixture weights sum to 1. Supervectors and fusion:
- For each adapted GMM, construct a supervector by stacking the means and diagonal covariances from all mixture components. Create one supervector per speech dimension; perform early fusion by concatenating articulation, phonation, and prosody supervectors. Optionally apply PCA to fused supervector retaining 90% cumulative variance. Classification and evaluation:
- SVM classifier with RBF kernel; grid search over C ∈ {0.001, 0.005, 0.01, …, 100, 500, 1000} and γ ∈ {0.0001, 0.001, …, 1000}. Hyperparameters selected as the mode over repetitions.
- Stratified 10-fold cross-validation repeated 10 times. Primary metric: accuracy; sensitivity and specificity reported for best cases. Binary setting: PD vs. ET; Tri-class: one-vs-rest SVM for PD vs. ET vs. HC. Additional analyses: Confusion matrices for best tri-class models; LDA visualization of feature spaces; Spearman correlations between LDA components and age or clinical severity (UPDRS-III, TETRAS).
- Binary PD vs. ET (Czech target):
- Best overall: /pa-ta-ka/ task, fusion of articulation+phonation+prosody, adapted from UBM trained on German controls: accuracy 86.2%, sensitivity 87.6%, specificity 84.8%.
- Monologue task, fusion with Verbmobil-derived UBM: accuracy 81.4%, sensitivity 83.2%, specificity 79.6%.
- Dimension-wise: monologue—articulation performed best on average (≈72.3%); /pa-ta-ka/—prosody performed best on average (≈78.3%). Errors were balanced between classes.
- Tri-class PD vs. ET vs. HC (Czech target):
- /pa-ta-ka/ task: best with prosody-only model adapted from combined German+Spanish UBM: accuracy 71.6%.
- Monologue task: best with fusion of three dimensions adapted from Verbmobil UBM: accuracy 63.3%.
- Confusion patterns: /pa-ta-ka/ prosody model correctly classified 96% of ET (48/50); 64% of HC correct (misclassified HCs tended to be labeled PD); 50% of PD correct. Monologue fusion model showed 34% of PD misclassified as HC.
- UBM composition effects:
- UBMs trained on German (Verbmobil or German controls) generally yielded the highest performance, suggesting linguistic proximity between Czech and German improves adaptation; combining German and Spanish did not consistently improve results.
- UBMs trained including patient data decreased performance, likely due to high variability and limited sample size.
- Task effects:
- Controlled diadochokinetic /pa-ta-ka/ achieved ≈5% higher accuracy than spontaneous monologue for PD vs. ET, yet monologue still performed strongly, supporting feasibility of unobtrusive assessment.
- Feature dimension insights:
- Monologue: articulation most discriminative, aligning with PD hypokinetic rigidity affecting voiced/unvoiced transitions.
- /pa-ta-ka/: prosody most discriminative, capturing ET tremor-related pitch, loudness, and timing variability.
- Additional analyses:
- No significant differences in UPDRS-III between correctly vs. incorrectly classified PD in either task (Mann–Whitney p=0.374 monologue; p=0.203 /pa-ta-ka/).
- Spearman correlations showed no or very weak associations between LDA components and age or severity; age well-balanced, reducing confounding.
The study demonstrates that automated speech analysis with GMM-UBM domain adaptation can effectively differentiate PD from ET across languages, addressing the clinical challenge posed by overlapping motor symptoms. High accuracies (86% /pa-ta-ka/, 81% monologue) indicate that articulation, phonation, and prosody convey complementary information about hypokinetic and hyperkinetic dysarthria. Articulation features in spontaneous speech reflect PD-related rigidity and impaired initiation of voicing, while prosody in the controlled /pa-ta-ka/ task captures ET’s tremor-driven fluctuations in pitch, loudness, and timing, enabling strong ET discrimination. Inclusion of healthy controls revealed substantial overlap between PD and aging-related voice changes, consistent with PD-to-HC misclassification, whereas ET’s prosodic abnormalities remained highly specific. Cross-language adaptation favored German UBMs, plausibly due to closer linguistic distance to Czech, suggesting that language similarity may facilitate model transfer for spontaneous speech; for language-independent tasks (/pa-ta-ka/), combining languages can yield more generalizable UBMs. The framework appears robust to differences in recording setups between training and target data, although evaluation data were recorded under high-quality conditions. Overall, findings support speech-based, cross-language tools as viable adjuncts for differential diagnosis and monitoring of movement disorders.
The authors developed a cross-language speech assessment framework using GMM-UBM adaptation and SVM classification across three speech dimensions (articulation, phonation, prosody) to distinguish PD, ET, and HC. Best binary PD vs. ET performance reached 86.2% accuracy (/pa-ta-ka/, fused dimensions) and 81.4% (monologue, fused). In tri-class classification, the best results were 71.6% accuracy with prosody-only on /pa-ta-ka/ and 63.3% with fused dimensions on monologue. Articulation and prosody emerged as key biomarkers: articulation modeling rigidity and voicing transitions, and prosody modeling tremor-induced pitch, energy, and timing variability. German-derived UBMs performed best for Czech adaptation, highlighting language similarity effects. Future work should extend to earlier disease stages, explore deep learning and transfer learning across languages and dysarthria types, consider normalization of F0 in semitone scale, and systematically evaluate robustness under varied target recording conditions.
- UBMs trained including patient data performed poorly, likely due to high variability and limited sample size; lack of non-Czech ET corpora precluded training a pathology-mixed base model without compromising balance.
- Target evaluation recordings were collected with high-quality equipment in low-noise settings; generalizability to noisier, lower-quality, or variable microphone conditions remains untested.
- Cognitive status and education were not specifically assessed; while the focus was motor speech, cognitive impairment could influence certain timing features (notably in PD), though ET timing abnormalities are disease-driven.
- ET patients’ symptom duration was substantially longer than PD on average; however, no significant correlation between ET speech scores and symptom duration was found, and group perceptual speech impairment was comparable.
- Most features were normalized (DC removal, amplitude normalization, Z-score), but F0 was not converted to a semitone scale; future normalization could further mitigate inter-speaker physical differences.
- Optimal performance depended on UBM language; combining languages did not always improve results, and data volume alone was less influential than language suitability.
Related Publications
Explore these studies to deepen your understanding of the subject.

