logo
ResearchBunny Logo
A machine learning contest enhances automated freezing of gait detection and reveals time-of-day effects

Medicine and Health

A machine learning contest enhances automated freezing of gait detection and reveals time-of-day effects

A. Salomon, E. Gazit, et al.

This groundbreaking study organized a machine-learning contest to tackle the challenging freezing of gait (FOG) in Parkinson's disease, attracting 1,379 teams and resulting in 24,862 solutions. The winning algorithms not only exhibited remarkable accuracy, but also unveiled new insights into FOG occurrences during daily life. Conducted by a diverse team of experts including Amit Salomon and Leslie C. Kirsch, this research showcases the transformative potential of machine learning in addressing critical medical issues.

00:00
Playback language: English
Introduction
Freezing of gait (FOG), a debilitating symptom affecting 38–65% of Parkinson's disease (PD) patients, manifests as sudden inability to initiate or continue walking. The lack of a widely applicable, objective FOG detection method hinders research and treatment. Current methods, such as self-report questionnaires and visual observation, suffer from reliability and subjectivity issues, leading to significant discrepancies in FOG prevalence estimates. FOG-provoking stress tests offer objective measures but have limitations in reflecting daily FOG occurrence and severity. Accurate, objective assessment is crucial for understanding FOG and advancing treatment. Wearable devices and advances in data science offer the potential for automatic FOG detection using inertial sensors and machine learning. While promising results have been shown in previous studies using automatic detection methods, these studies have limitations such as small sample sizes, limited reporting of performance metrics, and inconsistent results in unsupervised, daily living settings. To overcome these limitations, this research focused on developing a reliable, cost-effective, and widely applicable automatic FOG detection method using a single inertial measurement unit (IMU).
Literature Review
Several studies have explored automatic FOG detection using wearable sensors and various machine learning techniques. Initially, simple threshold-based approaches were used, followed by traditional machine learning algorithms such as support vector machines and random forests. More recently, deep learning models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, and autoencoders have gained traction. However, challenges persist, including generalizability, overfitting, and limited precision. Many studies validated their methods on small samples, with minimal reporting of crucial metrics such as precision and recall, especially important considering the class imbalance inherent in FOG detection (the positive class, FOG episodes, is significantly underrepresented). The use of multiple sensors further complicates real-world applicability due to reduced adaptability and compliance. Few studies have attempted to automatically detect FOG in unsupervised, daily living settings, and the results have been inconsistent.
Methodology
To address these challenges, a three-month machine-learning contest, offering a $100,000 prize, was organized. The contest aimed to foster the development of advanced, automated machine-learning algorithms for FOG detection using data from a single lower-back sensor. The competition utilized an open-access platform and provided a labeled dataset comprising 3D acceleration data and expert-labeled videos. 10,133 teams from 83 countries registered, submitting a total of 24,862 solutions. The submissions were ranked based on mean average precision (mAP) across three FOG classes (start hesitation, FOG during turns, FOG during walking), a metric less sensitive to class imbalance. A public test set allowed for interim evaluation during the competition, while a hidden private test set was used for the final ranking. Post-competition analyses included precision-recall and ROC curves, F1 scores, accuracy, precision, recall, specificity, and intraclass correlation coefficients (ICCs) to assess agreement between model estimates and gold-standard measures from video annotations. A subsequent exploratory analysis examined the application of the winning models to unsupervised, 24/7 daily-living data collected over seven days from 45 individuals with PD and FOG and 19 without. Statistical analyses included Mann-Whitney U tests with Benjamini, Krieger, and Yekutieli correction for multiple comparisons and Friedman's test for repeated measurements.
Key Findings
The top five models achieved high mAP scores on the private test set (0.514, 0.451, 0.436, 0.417, and 0.390, respectively). Precision-recall curves showed generally high AUC values (above 0.9) for overall FOG detection, with variations across different FOG classes. The models exhibited good accuracy (0.88–0.92) and high specificity (>0.9) for all FOG classes, though recall was relatively lower (0.72–0.79). ICCs between model estimates and gold-standard measures showed excellent agreement for % time frozen (%TF) and total FOG duration (ICCs > 0.87), indicating that the models accurately captured the percentage of time spent freezing and the overall duration. However, agreement on the number of FOG episodes was less robust. After removing unintentionally overlapping subjects between training and testing datasets, the key performance measures showed minimal changes, indicating good generalizability. The exploratory analysis of daily-living data revealed significant time-of-day effects in the freezer group, with two distinct peaks in %TF around 7:00 a.m. and 10:00 p.m. Daily %TF was consistent across days in both freezers and non-freezers. Turning FOG was the most common class in both daily-living and FOG-provoking test data, while start hesitation FOG was the least common. Comparing severe freezers, moderate freezers and non-freezers revealed significant differences in %TF.
Discussion
The results demonstrate that machine learning can significantly improve the accuracy and efficiency of FOG detection compared to traditional methods. The winning models achieved comparable or superior performance to previous studies while requiring only a single lower-back IMU, making it more practical for real-world applications. The high accuracy and precision, particularly for the 'all FOG' classification and turning FOG, suggest that these models could replace or augment manual video annotation by experts. The identified time-of-day effects are a novel observation, which warrants further investigation to understand their underlying mechanisms and implications. The use of ensemble-based architectures, particularly those incorporating GRU, LSTM, and transformer networks, contributed to the superior performance and generalizability of the winning models. The finding that longer inference input sizes may improve classification suggests that subtle motor alterations precede FOG by a longer duration than previously considered. Limitations include the lack of ground truth in daily-living data analysis, which should be viewed as preliminary findings, and the still-present challenge of accurately detecting start hesitation FOG.
Conclusion
This machine learning competition significantly advanced the state-of-the-art in automated FOG detection. The winning models demonstrated high accuracy and agreement with gold-standard measures, paving the way for practical, real-world applications. The discovery of time-of-day effects in FOG occurrence is a significant finding that requires further investigation. Future research should focus on improving the detection of specific FOG classes, particularly start hesitation FOG, and validating these models with ground truth daily-living data. The competition platform and dataset serve as valuable resources for ongoing research in this field.
Limitations
The daily-living analysis was limited by the lack of ground truth data; results should, therefore, be considered preliminary. There was some unintentional overlap between the training and private test sets, though sensitivity analyses showed minimal impact on overall findings. The models showed varying performance across different FOG classes, with start hesitation being particularly challenging. The competition data is skewed towards a particular FOG-provoking protocol which might not fully capture all real-world FOG variations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny