
Medicine and Health
A machine learning contest enhances automated freezing of gait detection and reveals time-of-day effects
A. Salomon, E. Gazit, et al.
This groundbreaking study organized a machine-learning contest to tackle the challenging freezing of gait (FOG) in Parkinson's disease, attracting 1,379 teams and resulting in 24,862 solutions. The winning algorithms not only exhibited remarkable accuracy, but also unveiled new insights into FOG occurrences during daily life. Conducted by a diverse team of experts including Amit Salomon and Leslie C. Kirsch, this research showcases the transformative potential of machine learning in addressing critical medical issues.
~3 min • Beginner • English
Introduction
Freezing of gait (FOG) in Parkinson’s disease (PD) is a common and disabling symptom characterized by sudden inability to initiate or continue walking, affecting quality of life and fall risk. Existing assessments rely on subjective questionnaires (e.g., NFOG-Q) and labor-intensive expert video annotations during FOG-provoking tests, which are limited in reliability, scalability, and ecological validity. Automated detection using wearable sensors and machine-learning has emerged, yet prior work often used small datasets, multiple sensors, imbalanced-class metrics, and lacked validation in real-world conditions. This study aimed to accelerate development of accurate, objective FOG detection using a single lower-back inertial sensor by organizing a global, 3-month machine-learning challenge, to: (1) achieve high performance in classifying FOG (start hesitation, turning, walking) during FOG-provoking tests; (2) benchmark algorithms against gold-standard video annotations; and (3) explore applying top models to continuous daily-living data to investigate time-of-day patterns and differences between freezers and non-freezers.
Literature Review
Prior approaches to FOG detection progressed from threshold-based methods to traditional ML (SVM, random forests) and more recently to deep learning (CNNs, RNNs, transformers, autoencoders). Reported results have been promising but limited by small sample sizes, overfitting, class imbalance, minimal reporting of precision/recall, and use of multiple sensors that reduce practical applicability. Few studies analyzed unsupervised daily-living data, yielding limited and inconsistent outcomes. Standard classification metrics like accuracy and ROC AUC can be misleading for the highly imbalanced FOG detection problem; precision-recall and event-level measures are more informative. The field lacks large, heterogeneous benchmarks and objective, widely applicable single-sensor solutions validated against gold-standard human annotations and in real-world settings.
Methodology
Design: A 3-month open Kaggle competition challenged participants to detect and classify FOG episodes (three classes: Start Hesitation, Turn, Walk) from single lower-back 3D accelerometer signals. The dataset comprised two labeled sources (tDCS FOG and DeFOG) annotated by expert video reviewers, and an unlabeled daily-living dataset for post-competition exploratory analyses.
Data and sensors: Lower-back inertial sensors (APDM Opal at 128 Hz; Axivity AX3/AX6 at 100 Hz) recorded ~92 hours combined, with ~64 hours video-annotated, totaling 4,818 FOG events (665.3 min). Labeled classes included Start Hesitation, Turn, and Walk; some DeFOG series included 'notype'. Representative signals are shown in Fig. 4. Demographics are summarized in Table 4.
Competition splits and scoring: Test data were randomly split into a public (n=26 patients, 945 FOG episodes) and a private (hidden) test set (n=14 patients, 391 validated FOG episodes). Teams submitted per-sample confidence scores for each class. The evaluation metric was mean average precision (mAP), averaged over the three classes, computed only on valid task segments for DeFOG. Leaderboards reflected public (during competition) and private (final) mAP.
Post-competition analyses on test sets: For the top five submissions, precision-recall and ROC curves (AUC), and operating-point metrics (F1, accuracy, precision, recall, specificity) were computed at the precision-recall point closest to (1,1). Agreement with gold-standard expert measures was assessed via ICCs for percent time frozen (%TF), number of FOG episodes, and total FOG duration. After discovering inadvertent subject overlap between training and test sets, sensitivity analyses re-computed metrics excluding overlapping subjects (5 private, 7 public).
Daily-living exploratory analyses: Top teams ran their models on week-long 24/7 real-world data centered on walking bouts and adjacent 5 s windows, detected using an established gait bout algorithm. Predictions from the 1st, 3rd, and 5th place models were also combined in a joint ensemble (FOG if detected by at least one). Hourly and daily %TF were computed as total FOG time divided by total walking-bout time. Group comparisons between freezers (n=45) and non-freezers (n=19) used Mann–Whitney U tests with Benjamini, Krieger, and Yekutieli correction. Within-freezer diurnal differences used Friedman tests with post-hoc multiple-comparisons control. Stability across days used ICC(A,k) over the first 6 days. Class distribution differences between real-world and FOG-provoking data used Wilcoxon signed-ranks tests.
Key Findings
- Participation and data: 10,133 registrations; 1,379 teams from 83 countries; 24,862 submissions. Labeled data contained >90 h recordings and ~4,818 FOG episodes.
- Competition performance: Private mAP for top five models: 0.514, 0.451, 0.436, 0.417, 0.390. ROC AUC for overall FOG detection exceeded 0.90 across winners. Precision-recall showed best performance for Turn FOG; trade-offs between Walking and Start Hesitation detection.
- Operating-point metrics (all FOG, private set): Accuracy 0.88–0.92; recall 0.72–0.79; specificity >0.90; precision 0.74–0.84; F1 0.73–0.81. Walking FOG remained challenging; Start Hesitation generally better than Walking for some models; Turn class dominated precision.
- Agreement with gold standards (ICCs):
• %TF: Excellent on private set (e.g., 1st: 0.949 [0.85–0.98]; 2nd: 0.934; 3rd: 0.942); very good when combining private+public (0.852–0.898), all p<0.001.
• Number of episodes: On private set, good for 1st, 2nd, 5th (ICC >0.75, p<0.001), moderate for 3rd (0.717), poor for 4th (0.093). On private+public, all ICCs <0.6 for episode counts.
• Total FOG duration: Highest agreement—ICCs >0.90 for all models (private and private+public), up to 0.991.
- Overlap sensitivity: Excluding overlapping subjects yielded small changes (<5%) in many key measures.
- Daily-living exploratory results:
• Using the 1st-place model, hourly %TF in freezers vs non-freezers showed visual peaks ~7 a.m. and ~10 p.m.; individual-hour differences were not significant with that single model between 7:00–22:00.
• Using the joint model (1st+3rd+5th), %TF differed significantly between freezers and non-freezers at nine daytime hours and two nighttime hours after multiple-comparisons correction. Freezers showed significant diurnal variation vs a night reference (Friedman p<0.001; post-hoc p<0.02).
• %TF stability across days: ICC 0.95 (0.92–0.97) for freezers; 0.90 (0.82–0.96) for non-freezers (both p<0.01). Between-group daily %TF effect size = 0.7 (moderate).
• Class distribution: Turning FOG most common in both contexts, more so in daily-living (94.7%±5.0) vs FOG tests (81.7%±28.7; p=0.002); Walking less common in daily-living (5.3%±5.0 vs 17.6%±27.6; p=0.002); Start Hesitation rare in daily-living (<0.01% vs 0.7%±4.3; p=0.051).
Discussion
The global challenge substantially advanced single-sensor automated FOG detection, achieving high agreement with expert-derived gold standards for %TF and FOG duration and competitive precision/F1 compared with prior studies. Counting discrete episodes was less reliable, likely due to fragmentation of events—post-processing to merge adjacent detections could improve this. Inter-rater-level performance relative to human experts suggests these ML approaches can augment or partially replace time-consuming video annotation.
Ensemble architectures predominated; all top models used GRU/LSTM components to leverage temporal context, and the 1st and 3rd place models incorporated transformers, indicating benefits of self-attention and longer-context segmentation over short-window classification. Using short sequences for training but longer sequences for inference may capture subtle pre-FOG changes, though gait initiation lacks pre-context and may require tailored approaches. Despite strong performance for Turn FOG, Walking and Start Hesitation remain challenging and warrant further model refinement.
Dataset imbalance and the natural rarity of FOG complicate evaluation; precision-recall metrics and mAP mitigated imbalance effects better than accuracy/AUC alone. While single-sensor approaches offer practicality and potential for real-world deployment, multi-sensor modalities (e.g., plantar pressure, EMG, EEG, HRV) may further enhance detection at the expense of complexity. Preliminary real-world application suggests time-of-day effects and a predominance of Turning FOG in daily living, but ground-truth validation in free-living conditions is needed before clinical adoption.
Conclusion
A large-scale machine-learning contest produced high-performing, single lower-back sensor-based FOG detection models that align well with gold-standard measures and reduce reliance on expert video labeling. Preliminary application to week-long free-living data revealed reproducible time-of-day patterns and reinforced that turning-related FOG predominates in daily life. Future work should: (1) improve detection of Walking and Start Hesitation FOG; (2) refine episode counting via post-processing; (3) validate real-world performance against ground-truth labels; (4) explore context-aware and multimodal enhancements while preserving practicality; and (5) leverage the released dataset and code as a benchmark for continued method development. Competitions of this kind can rapidly mobilize AI expertise to address complex medical challenges.
Limitations
- Unintentional overlap of some subjects between training and test sets; sensitivity analyses excluding overlaps showed small changes (<5%) but residual bias cannot be fully excluded.
- Daily-living analyses lacked ground-truth labels; findings are preliminary and may reflect false positives (e.g., peaks in non-freezers during periods of greater walking activity).
- Class imbalance (few FOG vs non-FOG samples; Turn FOG predominant) may bias model training and evaluation despite mAP and PR-curve usage.
- Lower performance for Walking and Start Hesitation classes limits class-specific applications.
- Single-sensor approach favors practicality but may underperform compared with multimodal systems; real-time implementation constraints (compute, latency) were not addressed.
- Episode count estimation was less robust, potentially due to fragmented detections requiring post-processing.
Related Publications
Explore these studies to deepen your understanding of the subject.