Medicine and Health
Using AI to measure Parkinson's disease severity at home
M. S. Islam, W. Rahman, et al.
Parkinson's disease (PD) is the fastest-growing neurological disease and lacks a cure; regular assessments and medication adjustments help manage symptoms, yet access to specialty neurological care is limited globally. Standard clinical evaluation of bradykinesia often uses a finger-tapping task scored via MDS-UPDRS. Prior video-based analyses typically used small cohorts, binary classifications, non-interpretable models, and clinic-recorded, low-noise data—raising concerns about generalizability to at-home recordings. This study asks whether an AI system using webcam-recorded finger-tapping can reliably and interpretably estimate MDS-UPDRS finger-tapping severity (0–4) from home videos, improving accessibility and enabling remote monitoring. The authors collected a large at-home dataset, designed clinically aligned, interpretable features, and compared AI predictions against expert and non-expert clinician ratings.
Prior work applied video or smartphone-based analysis to bradykinesia and related movement disorders, but often with fewer than 20 participants or limited to binary severity or case-control classification without producing graded MDS-UPDRS severity. Many models lacked interpretability needed for clinical adoption and were trained on clean, clinic-recorded data, risking performance degradation under home-environment noise and data shift. Related efforts (e.g., supervised bradykinesia classification from smartphone videos; automatic severity estimation for ataxia from finger-tapping) demonstrate feasibility but do not provide robust, interpretable, continuous severity estimation aligned to MDS-UPDRS from large, unsupervised at-home recordings. This work addresses these gaps by: collecting a larger, mostly at-home dataset; deriving clinically meaningful, interpretable kinematic features (speed, amplitude, rhythm, hesitations); and rigorously comparing against expert and non-expert raters with reliability analyses.
Study population and data collection: 250 participants (172 with self-reported PD, 78 controls) recorded finger-tapping with both hands via a public web tool (parktest.net), primarily at home (≈80%); 48 recorded in clinic. Each hand was treated as a separate video (initially 500), with 11 excluded after quality checks, yielding 489 videos (244 left, 245 right). Demographics (age, sex, ethnicity, PD diagnosis) were self-reported. Ground truth ratings: Three US expert neurologists (≥5 years PD experience) independently scored each video using MDS-UPDRS finger tapping (0–4). Ground truth per video was set by majority agreement (451/489) or rounded average if no majority (38/489). Two additional MDS-UPDRS-certified non-experts also rated videos for comparative purposes. Inter-rater reliability among experts: Krippendorff’s alpha 0.69; ICC 0.88 (95% CI [0.86, 0.90]). Video processing and keypoint tracking: Videos were split into left/right segments. MediaPipe Hands detected 21 hand keypoints per frame; the target hand was selected using handedness labels and heuristics for multiple detections (largest hand via wrist–thumb-tip distance). Keypoints used for primary kinematics: WRIST (0), THUMB_TIP (4), INDEX_FINGER_TIP (8). To normalize for camera distance, the main measure was the angle at the wrist subtended by the wrist–thumb and wrist–index vectors, computed per frame (~30 fps), producing a continuous time series. Frames with missing/low-confidence hand detections (hand presence <0.90) were marked as missing (angle = −1) and interpolated when neighbors were valid; leading/trailing non-task segments were removed by keeping the longest contiguous valid segment. Noise reduction and tap segmentation: A custom peak-detection algorithm identified tapping peaks using task-specific constraints (peak-bottom alternation, minimum inter-peak interval, peak prominence above 25th percentile). To minimize boundary artifacts, the first and last tap were removed, keeping the central clean segment for analysis. Feature engineering: From per-frame signals, continuous speed (first derivative of angle over frame time) and acceleration (second derivative) were computed. From peak sequences, per-tap period, frequency, and amplitude (peak angle) were derived. Wrist movement features were computed from normalized wrist coordinates (absolute X, Y, and Cartesian displacement per frame). Aggregate rhythm and regularity features included aperiodicity (FFT-based power spectrum entropy), number of interruptions (<50°/s for ≥10 ms), number and longest duration of freezing events (<50°/s for >20 ms), tapping period linearity (R², slope), polynomial degree to fit periods (up to degree 10 with R² ≥ 0.9), and amplitude decrement metrics (end vs. start; linear slope). In total, 65 features were computed; highly correlated pairs (|r|>0.85) were pruned, yielding 53 features. Pearson correlations versus ground truth identified 18 significant features (α=0.01), including speed (IQR, median, max, min), acceleration (min), amplitude (median, max), frequency (IQR, SD), period (entropy, IQR, min), interruptions, freezing metrics, aperiodicity, period fitting complexity, and wrist movement (minimum Cartesian distance). Modeling and evaluation: Multiple regressors (SVR, Random Forest, AdaBoost, XGBoost, LightGBM, shallow NNs) were tuned via extensive hyperparameter search (Weights & Biases). Feature selection used Boosted Recursive Feature Elimination (BoostRFE) with a LightGBM base learner; the best pipeline selected the top 22 of 53 features and StandardScaler. Class imbalance (few class 3, 4 samples) was explored with SMOTE but did not improve performance and was not used. Performance was assessed with leave-one-patient-out cross-validation (both hands of a patient held out together). Metrics: MAE, MSE, PCC, Spearman’s ρ, Kendall’s τ, MAPE, and classification accuracy after rounding predictions to 0–4. Model interpretability used SHAP to attribute feature importance and assess alignment with clinically meaningful features. Bias and robustness analyses: Group-wise errors were compared across sex, PD vs. non-PD, age correlation, and ethnicity (white vs. non-white) using t-tests and correlations. Video quality impact was assessed by expert-flagged difficulty (high vs. low quality), MediaPipe hand presence scores, and controlled noise injections (blur and Gaussian noise) to high-quality videos; effects on pose tracking and model errors were evaluated statistically.
- Dataset and ratings: 250 participants (172 PD, 78 controls) produced 489 analyzable hand videos; experts showed high agreement (ICC 0.88, 95% CI [0.86, 0.90]); all three experts agreed in 30.7% of videos; at least two agreed in 93%.
- Feature correlations: 22 features significantly correlated with ground truth (α=0.01). Top correlations (Table 2): speed IQR (r = −0.56, p = 1e−33), speed median (r = −0.52, p = 1e−27), amplitude median (r = −0.50, p = 1e−25), amplitude max (r = −0.41, p = 1e−17), frequency SD (r = 0.29, p = 1e−8), speed max (r = −0.32, p = 1e−10), frequency IQR (r = 0.32, p = 1e−10), period entropy (r = 0.32, p = 1e−10), period variance normalized (r = 0.28, p = 1e−8), period IQR (r = 0.27, p = 1e−7).
- Model performance (LightGBM, LOPO-CV): MAE = 0.58 points; MSE = 0.536; PCC = 0.66; Spearman’s ρ ≈ 0.64; Kendall’s τ ≈ 0.515; MAPE ≈ 32.0%; classification accuracy after rounding = 50.92%; ICC = 0.76 (95% CI [0.71, 0.80]). Compared to human raters, model outperformed non-experts (non-experts’ average MAE = 0.83; PCC = 0.61; ICC ≈ 0.75 to ground truth) but underperformed relative to expert-level agreement (pairwise expert MAE = 0.53; PCC = 0.72).
- Confusion trends: 63% of ground-truth class 0 were predicted as class 1; poorer accuracy on classes 3–4, consistent with class imbalance (only 54 videos with class 3 and 5 with class 4). Overall classifier accuracy (rounded regression) 50.92%.
- Interpretability: SHAP-identified top features aligned with statistically significant clinical features (e.g., speed IQR, freezing counts/durations, aperiodicity, period variance/min, wrist movement metrics, frequency IQR). Median period appeared in top SHAP features despite not being significantly correlated, consistent with MDS-UPDRS emphasis.
- Fairness/bias: No significant error differences by sex (MAE males 0.60±0.48, females 0.55±0.39; p=0.21), PD status (PD 0.57 vs. non-PD 0.59; p=0.61), or age (error vs. age r = −0.06; p=0.20). White vs. non-white: MAE 0.57±0.45 vs. 0.65±0.41; p=0.29 (slightly worse for non-white, not significant).
- Video quality impact: Expert ICC for high-quality vs. low-quality videos: 0.879 (95% CI [0.86, 0.90]) vs. 0.806 (95% CI [0.73, 0.86]); majority agreement not associated with quality (χ²=0.9965, p=0.318). MediaPipe hand presence mean ≈0.967 (high-quality) vs. 0.962 (low-quality); t=0.91, p=0.36. Injected noise reduced hand presence significantly for substantial blur and higher Gaussian noise. Model errors similar across quality groups: MAE 0.581 (high) vs. 0.578 (low); t=0.064, p=0.95.
The study demonstrates that expert neurologists can reliably rate at-home finger-tapping videos (high ICC), and that an interpretable AI model trained on clinically aligned kinematic features can approximate expert-level performance and surpass non-expert raters for MDS-UPDRS finger-tapping severity. The model’s significant feature-behavior relationships (e.g., speed variability, amplitude, rhythm/aperiodicity, freezing) match clinical expectations, supporting interpretability and clinical relevance. Robustness analyses indicate equitable performance across sex, age, and PD status, with no statistically significant differences, and only a nonsignificant trend toward higher error in non-white participants. Despite variability in home video quality, pose tracking generally remained confident and model performance was unaffected at group level, though synthetic noise stress tests reveal sensitivity to severe artifacts. These findings support the feasibility of remote, objective, scalable severity assessment of bradykinesia-related performance using consumer webcams, potentially extending access to care and enabling continuous monitoring.
This work introduces an at-home, AI-driven, interpretable assessment of MDS-UPDRS finger-tapping severity using webcam videos. Contributions include: (1) demonstrating reliable expert ratings of at-home recordings; (2) developing clinically meaningful digital biomarkers that correlate with expert severity scores; and (3) building a LightGBM model that outperforms MDS-UPDRS-certified non-experts and approaches expert agreement. The system can facilitate remote evaluations, augment clinicians between visits, and may generalize to other motor and neurological tasks (e.g., tremor, speech, facial expression, gait). Future directions include longitudinal monitoring (ON/OFF medication states), expanding to a larger and more diverse dataset with better balance for severe classes, implementing real-time quality control “gatekeepers,” enhancing fairness for underrepresented groups, and extending the platform to a broader neurological test battery.
- Potential confounding by tremor: Tremors may disrupt peak detection and pose tracking, affecting derived features (period, frequency, amplitude) and bradykinesia assessment; tremor diagnoses were unavailable to analyze this interaction.
- Dataset size and class imbalance: Although large in unique participants for this domain (250), the 489 videos yield few severe cases (class 3: 54; class 4: 5), limiting performance on higher severities and overall classification accuracy.
- Demographic imbalance: 92% of participants self-reported as white; slightly higher errors observed for non-white subjects (not statistically significant), potentially due to underrepresentation.
- Video quality variability: While group-wise performance was similar across expert-labeled quality categories, severe synthetic noise degraded hand tracking confidence, indicating sensitivity to extreme artifacts.
- Generalization and overfitting risks: Small sample size relative to disease heterogeneity; SMOTE did not help; further data and improved collection (including per-hand recording and automated hand selection) are needed.
- Clinical scope: The tool assesses a single MDS-UPDRS task and is intended for symptom monitoring rather than standalone diagnosis; real-world deployment will require attention to privacy, security, and fairness.
Related Publications
Explore these studies to deepen your understanding of the subject.

