
Medicine and Health
Automating General Movements Assessment with quantitative deep learning to facilitate early screening of cerebral palsy
Q. Gao, S. Yao, et al.
Discover the groundbreaking deep learning-based motor assessment model (MAM) that automates the General Movements Assessment for early screening of cerebral palsy. Achieving remarkable validation accuracy, this research by Qiang Gao, Siqiong Yao, Yuan Tian, Chuncao Zhang, Tingting Zhao, Dan Wu, Guangjun Yu, and Hui Lu paves the way for enhanced medical diagnostics.
~3 min • Beginner • English
Introduction
The Prechtl General Movements Assessment (GMA) evaluates general movements (GMs) in infants to assess nervous system integrity and detect motor abnormalities. At 9–20 weeks corrected age, infants exhibit fidgety movements (FMs), whose absence is a strong predictor of cerebral palsy (CP), a prevalent childhood motor disability. Although GMA achieves high sensitivity (98%) and specificity (94%) for CP prediction, its reliance on trained, certified experts limits widespread early screening. Advances in artificial intelligence have enabled sensor- and video-based approaches to analyze infant movements, with deep learning and pose estimation improving feature extraction in noisy environments. However, existing automated GMA methods are predominantly qualitative, providing only final classifications without quantitative metrics or interpretability, and often emphasize specific body parts contrary to GMA’s gestalt principle. This study introduces a quantitative, interpretable deep learning model, MAM, leveraging 3D pose estimation, distance-based motion representation, and multi-instance learning (MIL) to predict GMs at the FMs stage, identify FMs clips within videos, and quantify GMA via FMs frequency to facilitate early CP screening.
Literature Review
Prior work assessing infant neurodevelopment used wearable sensors or motion capture to collect precise kinematic data for predicting GMs and CP with machine learning. While accurate, such devices can alter natural infant movement and burden clinical workflows. Non-intrusive video-based methods (e.g., background subtraction, optical flow) capture motion changes across frames; with deep learning and pose estimation, these approaches better handle background interference and extract spatiotemporal features from joint coordinates. Automated GMA at the FMs stage has emphasized qualitative spatiotemporal models yielding categorical outcomes but lacking objective quantitative measures. Attempts at quantitative GMA have analyzed movement direction, magnitude, speed, and acceleration across body parts, yet often underperform qualitative methods, violate GMA’s holistic (gestalt) principles by overemphasizing parts, and provide limited interpretability. Recent deep learning approaches have begun to address these challenges but have not reliably localized FMs within videos or delivered robust quantitative metrics aligned with clinical practice. The present work builds on these insights by using 3D pose estimation and a Transformer-based MIL framework to both classify and localize FMs and to introduce a clinically consistent quantitative measure (FMs frequency).
Methodology
Study design and cohorts: Three cohorts were assembled from Shanghai Children’s Hospital under IRB approval with parental consent. Cohort 1 (initial n=1204) and Cohort 2 (initial n=283) were used for internal cross-validation and external validation, respectively; Cohort 3 (n=298 excluded from Cohort 1) was used for pretraining. Inclusion followed GMA recording standards at 9–20 weeks corrected age; exclusions included out-of-range age, missing basic characteristics, lack of supine position, crying/unawake, video duration <1 min, and abnormal FMs. Abnormal FMs were rare and excluded due to limited predictive value. Normal FMs included continuous and intermittent FMs; risk comprised sporadic and absent FMs. After filtering: Cohort 1 retained 906 infants (normal 691; risk 215), Cohort 2 retained 221 infants (normal 173; risk 48). Cohort 3 retained 243 infants for pretraining after excluding 55; videos were annotated into 1,586 FMs and 4,100 non-FMs clips. Video characteristics: Cohort 1 median duration 25.5 s (range approx. 119–653 s), mostly 25 fps; resolutions mainly 1920×1080. Cohort 2 median duration 27.5 s (range approx. 181–556 s), 25 fps; resolutions mainly 1920×1080. Cohort 3 median duration 99 s (range approx. 168–409 s), mostly 25 fps; resolutions varied. Baseline characteristics (sex, gestational age, birth weight, corrected age) differed nonsignificantly between normal and risk groups (p>0.01). Annotation: Two certified GMA experts (>5 years experience) independently labeled videos; disagreements were resolved by consensus; inter-rater kappa=0.947. For FMs localization, videos were segmented into 9.6 s clips (step 0.8 s); clips with FMs proportion >0.5 were labeled FMs. Model architecture (MAM): A multi-instance, multimodal Transformer-based framework with three branches: Ref Branch, Main Branch, and Info Branch. - 3D pose estimation and preprocessing: VideoPose3D was fine-tuned for supine infants. Five frames per video in Cohort 3 were manually annotated to fine-tune the HRNet kernel within VideoPose3D. Videos were standardized to 25 fps; 3D coordinates for 17 joints were smoothed (moving average window=5) and normalized. - Input construction: Motion represented by distance matrices capturing Euclidean distances between pairs of joints across spatial dimensions, forming tensors over time (T×C×V), emphasizing overall motion patterns and inter-joint coordination. - MIL definitions: Each full video is a “bag”; it is split into fixed-length “instances” (clip length 240 frames, overlap 90 frames). - Ref Branch: Pretrains the spatiotemporal Transformer and a clip-level classifier on annotated FMs/non-FMs clips (Cohort 3), using cross-entropy and Triplet loss (margin 0.4) to separate FMs vs non-FMs embeddings. - Main Branch: Processes whole videos, splits into instances, infers instance-level FMs probabilities via shared Transformer and classifier, aggregates via attention-based fusion to predict bag-level normal vs risk probability. A Closeness loss drives instance embeddings toward the FMs/non-FMs centers learned in the Ref Branch to improve instance discrimination. - Info Branch: Processes basic characteristics (sex, gestational age, birth weight, corrected age) via a fully connected network to predict normal vs risk probability. Fusion: Main and Info Branch outputs are combined for final prediction; total loss is the sum of Triplet, Closeness, and two cross-entropy losses. Training: Batch size 64 (half pretraining clips; half instances from one normal and one risk bag), 300 epochs, SGD optimizer, initial LR=0.001, CyclicLR scheduler. Evaluation: Internal cross-validation on Cohort 1; external validation on Cohort 2. Metrics included AUC, accuracy, sensitivity, specificity, PPV, NPV. Statistical analysis used two-tailed tests (alpha=0.01); chi-square for categorical, t-test or Mann–Whitney for continuous variables; Cohen’s kappa for agreement. Quantitative GMA was defined as FMs frequency (proportion of FMs clips per video) derived from MAM’s clip-level predictions.
Key Findings
- Cohorts and data: After filtering, internal dataset n=906 (normal 691; risk 215); external dataset n=221 (normal 173; risk 48). Cohort 3 provided 1,586 FMs and 4,100 non-FMs clips for pretraining. - Internal cross-validation performance: AUC 0.934 ± 0.014; PPV 0.526 ± 0.031; NPV 0.900 ± 0.006. - External validation performance: AUC 0.967 ± 0.005; accuracy 0.934 ± 0.003; sensitivity 0.925 ± 0.004; specificity 0.936 ± 0.009; PPV 0.802 ± 0.022; NPV 0.978 ± 0.008. - Comparison with state-of-the-art methods (same data and tuning): MAM outperformed EML, STAM, and WO-GMA across AUC, accuracy, sensitivity, specificity, and PPV; all methods had high NPV but MAM was best. Representative metrics reported include AUC up to 0.973 ± 0.007; accuracy 0.938 ± 0.002; sensitivity 0.939 ± 0.021; specificity 0.934 ± 0.014; PPV 0.826 ± 0.031; NPV 0.968 ± 0.006. - Ablation (Info Branch): Removing the Info Branch (MAM.w/o.info) led to slight decreases in several metrics; differences were not statistically significant (p>0.01). SHAP indicated video features dominate predictions; basic characteristics had small, directionally unclear contributions. - Input and pose estimation choices: Distance matrix inputs outperformed coordinates, velocities, accelerations, and their combinations. Using 3D pose estimation outperformed 2D pose under the same input construction. - FMs identification and agreement: MAM localized FMs clips within videos with substantial agreement to GMA experts; median kappa approximately 0.601–0.620. Removing the Closeness loss and/or Ref Branch reduced concordance (e.g., kappa ~0.224, and further to ~0.064 without both). - Quantitative GMA (FMs frequency): In external validation, FMs frequency median was 0.553 (Q1–Q3: 0.412–0.706) in the normal group vs 0.135 (Q1–Q3: 0.082–0.215) in the risk group (p<0.01). Using FMs frequency to classify normal vs risk achieved AUC 0.956 (95% CI: 0.924–0.989). - Assistance to GMA beginners: Diagnostic accuracy of three beginners improved from 0.846, 0.869, 0.837 to 0.950, 0.973, 0.959 with MAM assistance (average increase 0.112).
Discussion
MAM, a multimodal Transformer-based motor assessment model leveraging MIL, 3D pose estimation, and distance-matrix motion representations, achieved state-of-the-art performance in predicting GMs at the FMs stage and in localizing FMs within videos. By integrating a Ref Branch with Triplet loss and a Closeness loss in the Main Branch, MAM effectively aligns instance embeddings with FMs/non-FMs prototypes, enhancing clip-level identification. This localization supports clinical interpretability, aligning with GMA practice where perception of FMs is central and absence of FMs strongly predicts CP. Multimodal fusion with basic infant characteristics provided marginal incremental benefit, consistent with clinical assessment practices that largely rely on age-specific movement observation. Crucially, MAM enables quantitative GMA via FMs frequency, delivering objective, gestalt-consistent measurements that correlate with expert assessments and discriminate normal vs risk infants with high accuracy (AUC 0.956). The identified high-probability FMs clips also serve as educational cues, significantly improving the accuracy of GMA beginners. Collectively, these results suggest that MAM can enhance early CP screening workflows by providing accurate, interpretable, and quantifiable assessments from routine videos without intrusive sensors.
Conclusion
This study presents MAM, a multi-instance, multimodal Transformer framework that automates GMA at the FMs stage using 3D pose-based distance representations and interpretable FMs localization. MAM surpasses existing methods on internal and external validations, quantitatively distinguishes normal from risk infants via FMs frequency, and aids training of less experienced raters. By adhering to GMA’s gestalt principles and offering transparent, quantitative outputs, MAM has potential to streamline universal early CP screening and advance video-based quantitative diagnostics.
Limitations
Related Publications
Explore these studies to deepen your understanding of the subject.