logo
Loading...
Machine Learning in the Parkinson’s disease smartwatch (PADS) dataset

Medicine and Health

Machine Learning in the Parkinson’s disease smartwatch (PADS) dataset

J. Varghese, A. Brenner, et al.

Discover groundbreaking research on Parkinson's Disease with the PADS dataset, derived from a three-year study involving 504 participants. This collection, integrating multimodal smartphone apps and smartwatches, harnesses machine learning for impressive accuracy in distinguishing Parkinson's disease from healthy controls and differential diagnoses. Join authors Julian Varghese, Alexander Brenner, Michael Fujarski, Catharina Marie van Alen, Lucas Plagwitz, and Tobias Warnecke in exploring this revolutionary resource for movement disorder research.... show more
Introduction

Parkinson’s disease (PD) is a common neurodegenerative disorder with major motor (rigidity, tremor, bradykinesia, gait issues) and non-motor symptoms that degrade quality of life. Diagnosis is primarily clinical, sometimes supported by imaging, but heterogeneous presentations and progression complicate early and accurate diagnosis. Technology-based systems and digital biomarkers offer objective measures for prediction, prognosis, and severity assessment; here the focus is diagnostic discrimination. Prior work using various sensors (hand/gait/balance/eye/voice) has shown promising PD-vs-healthy control (HC) accuracy, including large smartphone datasets, but often with self-reported diagnoses, potential selection and demographic imbalances, and limited inclusion of PD differential diagnoses (DD). Importantly, models trained to separate PD from HC may not generalize to differentiating PD from similar movement disorders (e.g., Essential Tremor, atypical parkinsonism, Multiple Sclerosis). Interactive assessments may better provoke subtle motor abnormalities compared to passive monitoring. To address gaps in breadth, annotation, and comparability, the authors developed a Smart Device System (SDS) combining a smartphone app, two wrist-worn smartwatches, and electronic questionnaires to conduct an 11-step interactive neurological movement assessment (~15 minutes) with strict protocol and supervision. The system captures bilateral wrist acceleration/rotation signals and integrates demographics, medical history, and non-motor symptom data. Over three years, more than 5,000 measurement steps were recorded from a large cohort including PD, DD, and HC. The study introduces and publicly releases the Parkinson’s Disease Smartwatch (PADS) dataset: after exclusions and age-matching, 469 individual cases remain, with detailed annotations. The authors also implemented and open-sourced a comprehensive ML pipeline integrating classical features, symbolic Fourier methods, and deep learning, along with questionnaire-based classifiers and classifier stacking. They evaluate two diagnostic tasks: PD vs. HC and PD vs. DD, hypothesizing the former is easier as it detects pathology versus health, whereas the latter requires PD-specific feature learning to distinguish clinically similar disorders. Overall, the work aims to provide a high-quality multimodal dataset and a reproducible ML framework to improve diagnostic accuracy and enable future monitoring applications.

Literature Review

The paper reviews prior technology-based PD diagnostic efforts across multiple modalities including hand movement, gait/balance, eye movements, and voice, many achieving high PD-vs-HC accuracy. A large smartphone-based study included >8,000 subjects but relied on self-reported diagnoses. Several studies using video or smartphone tasks (e.g., finger tapping, spiral drawing) reported high accuracy; however, many did not control for confounders (age, gender) or lacked sufficient DD control groups. Models often learn general movement abnormality rather than PD-specific features, risking misclassification of other movement disorders. Inclusion of comparable DD cohorts is limited in size and scope in prior literature, and passive monitoring predominates over interactive assessments that can elicit re-emergent tremor and other subtle signs. Dataset heterogeneity further impedes comparability. The authors position PADS as addressing these gaps with standardized interactive tasks, rich clinical annotations, and inclusion of PD alongside multiple DDs for robust, clinically relevant evaluation.

Methodology

Design and participants: Cross-sectional study (2018–2021) registered at ClinicalTrials.gov (NCT03638479), approved by the University of Münster ethics board and physician’s chamber of Westphalia-Lippe (2018-328-F-S). Written informed consent obtained. Conducted at the University Hospital Münster outpatient movement disorders clinic. Three groups: Parkinson’s disease (PD), differential diagnoses (DD; including Essential Tremor, atypical parkinsonism such as PSP/MSA, Multiple Sclerosis, and other), and healthy controls (HC). Controls were age-matched to PD. Diagnoses were made by board-certified neurologists and reviewed by a senior movement disorder expert; DaT scans performed when indicated. A perfect gold standard for PD does not exist; label accuracy reflects state-of-the-art clinical practice. Cohort characteristics (example table excerpt): PD n=276 (195M/81F), age 65.4±9.6 yrs; HC n=79 (29M/50F), age 62.9±12.5; DD n=114 (57M/57F), age 62.4±11.5. The full study recorded 5,544 assessment steps; after exclusions and age matching, 469 cases are included; dataset release notes 449 assessments with signals plus rich metadata. Data acquisition system: A smartphone app orchestrated electronic questionnaires (including non-motor symptom domains) and an interactive 11-step wrist-movement assessment using two synchronized smartwatches (bilateral wrists). The assessment was designed by movement disorder specialists to provoke subtle motor abnormalities (e.g., resting, postural, kinetic tasks). Steps and durations (seconds): 1a Resting eyes closed (20); 1b Resting with serial sevens (20); 2 Lift/extend arms (10); 3 Maintain arms lifted (10); 4 Hold 1-kg weights (5 s each side; total 10); 5 Finger-to-examiner hand pointing (10); 6 Simulated drinking (10); 7 Cross and extend arms (10); 8 Bring index fingers together (10); 9 Finger-to-nose alternating (10); 10 Entrainment via paced stomping/postural (20). Bilateral acceleration and rotation captured; extensive annotations include demographics, medical history, movement and non-motor symptoms. ML pipeline and evaluation: Two classification tasks—(1) PD vs. HC and (2) PD vs. DD. A fixed random seed 5-fold nested cross-validation (CV) was used: outer 5-fold for testing, inner 5-fold for hyperparameter optimization (grid search) including input modality selection, model choice, and training parameters. Performance metrics: balanced accuracy (primary), precision, recall, F1; results averaged across outer folds, with per-fold reporting. To avoid identity confounding, all time series from a subject were contained within a single sample and not split across train/test. Class imbalance/confounding control: Although age was matched, gender and class distributions remained imbalanced. Loss functions used baseline weighting akin to scikit-learn’s class-balanced weighting, considering joint gender-class groups to reduce spurious correlations. Additional evaluation on gender-matched subsets (random undersampling) assessed potential bias. Modeling options: Smartwatch signals used three options. Option A: manually engineered features from acceleration with spectral (Welch PSD from 1–19 Hz, log-scaled; 19 bins/channel) and time-domain stats per four segments (std, max |amp|, mean absolute energy; 12 features/channel), concatenated across channels. Classifiers: SVM, CatBoost, fully connected feed-forward NN. Option B: automatic feature extraction via classical signal processing using Symbolic Fourier Approximation (SFA) and Bag-of-Symbols at multiple scales (window sizes 20/40/80), per-channel features concatenated; classifiers as in Option A. Option C: deep learning with XceptionTime to model temporal/spatial structure across channels; channels separated by axes/sensor/arm with some aggregation; trained with balanced cross-entropy, 1-cycle learning policy; data augmentation via randomized time warping and rotations. Questionnaire data (Option D): CatBoost gradient-boosted decision trees tuned to handle categorical variables; default learning-rate adjustments per dataset size/iterations. Implementation used scikit-learn, skorch, tsai, and catboost packages. Classifier stacking: For multimodal fusion, stacking combined the best smartwatch-based and questionnaire-based classifiers (as selected by inner CV) per outer fold; meta-classifier was logistic regression fit on internal CV predictions. Feature importance: Grouped permutation importance assessed information gain by shuffling grouped inputs—questionnaire domains (e.g., Sleep/Activity) and smartwatch feature groups by assessment category (Resting/Postural/Kinetic)—recording decreases in balanced accuracy across folds and repetitions. Reproducibility and availability: Randomization seeds and hyperparameters are provided; code repository publicly available. The de-identified PADS dataset (449 assessments with signals from both wrists across 11 tasks, plus demographics/medical history/non-motor symptoms) is publicly accessible upon publication with documentation.

Key Findings
  • Overall performance: Classifier stacking (smartwatch + questionnaire) achieved average balanced accuracy of 91.16% for PD vs. HC and 72.42% for PD vs. DD (complete test folds). On gender-matched subsets, balanced accuracy was 89.25% (PD vs. HC) and 69.56% (PD vs. DD).
  • Single-modality results (mean across folds, complete test sets): Smartwatch-only models: PD vs. HC 78.99% balanced accuracy; PD vs. DD 69.18%. Questionnaire-only (CatBoost on non-motor symptoms): PD vs. HC 89.79%; PD vs. DD 67.77%.
  • Matched subset results: Smartwatch-only PD vs. HC 76.39%; PD vs. DD 67.94%. Questionnaire-only PD vs. HC 87.67%; PD vs. DD 67.12%.
  • Precision/Recall/F1 trends (PD vs. HC, complete): Stacking yielded high precision (96.98%), recall (92.40%), F1 (94.20%). For PD vs. DD (complete), stacking improved balanced accuracy to 72.42% with F1 60.45%, precision 54.92%, recall 67.71%.
  • Feature importance: Both modalities contributed. For PD vs. HC, the “Sleep/Activity” questionnaire domain showed highest information gain (indicative of general pathology vs. health). For PD vs. DD, smartwatch-derived “Kinetic tasks” contributed most, suggesting relevance for PD-specific differentiation among movement disorders.
  • Data scale: After exclusions and age matching, 469 cases were analyzed. The dataset offers rich annotations and synchronized bilateral smartwatch recordings across standardized tasks.
Discussion

The study demonstrates that multimodal integration of smartwatch signals and patient-reported non-motor symptoms can improve diagnostic discrimination of PD, particularly over healthy controls and, to a lesser degree, over clinically similar DDs. The stacked classifier consistently outperformed single-modality approaches, underscoring complementary information: questionnaires capture broader symptomatic burden beyond the brief sensor snapshot, while interactive movement tasks elicit PD-relevant kinematic patterns (notably in kinetic tasks) useful for DD discrimination. Smartwatch precision previously validated against a gold-standard device supports the reliability of extracted movement features, with higher sensitivity to tremor parameters than typical clinical observation. While deep learning architectures hold promise, their performance is constrained by dataset size; classical symbolic Fourier-based features (BOSSA/SFA) performed strongly across folds. The findings address the core question of whether consumer-grade devices and an interactive app-based protocol, paired with advanced ML, can yield clinically meaningful diagnostic signals: high balanced accuracy for PD vs. HC and reasonable, though more modest, accuracy for PD vs. DD reflect both the promise and the inherent difficulty of distinguishing closely related phenotypes. Feature importance analyses corroborate the value of both non-motor domains and specific movement task categories, guiding future protocol optimization. The publicly released dataset and code provide a benchmark for reproducible research and algorithmic advances in movement disorder diagnostics.

Conclusion

This work introduces and publicly releases the PADS dataset, the largest collection to date of smartwatch recordings from a standardized interactive assessment including PD and multiple DDs, enriched with detailed demographics, medical history, and non-motor symptom data. A reproducible ML pipeline integrating classical features, symbolic Fourier approaches, deep learning, and questionnaire-based modeling achieved high diagnostic performance for PD vs. HC and improved PD vs. DD discrimination via multimodal stacking. These contributions establish a strong foundation for future research on digital biomarkers for diagnosis, monitoring, and treatment optimization. Given the consumer-grade hardware and app-guided protocol, the system could enable scalable home-based assessments. Future directions include expanding to multi-site cohorts, increasing sample sizes to benefit deep models, refining task protocols based on feature importance, integrating additional sensors, implementing automated data quality control for unsupervised settings, and extending analyses to disease progression and severity prediction.

Limitations
  • Diagnostic labels reflect best clinical practice (including DaT scans when indicated) but are not a perfect gold standard; some mislabeling risk remains.
  • Single in-clinic, one-time assessment may not capture temporal variability of PD symptoms; generalization to daily life requires further study.
  • Demographic imbalances (notably gender) and class imbalance persist despite age matching; although sample weighting and matched-subset analyses were used, residual confounding cannot be fully excluded.
  • Relatively high variance across folds (e.g., up to 12.33% SD in PD vs. DD balanced accuracy) suggests sensitivity to limited test sizes and potential overfitting in high-dimensional feature spaces.
  • Home-based deployments may suffer reduced data quality without supervision; automated quality control is recommended.
  • Clinical scales such as Hoehn and Yahr were not included in the dataset and would require extraction from clinical documents for further analyses.
  • Deep learning models may be underpowered given current sample sizes; larger datasets are needed to realize their full potential.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny