
Medicine and Health
Machine learning for accurate estimation of fetal gestational age based on ultrasound images
L. H. Lee, E. Bradburn, et al.
Unlock the future of obstetric care with groundbreaking research by authors including Lok Hin Lee and Elizabeth Bradburn. This study utilizes machine learning to estimate gestational age from ultrasound images, achieving remarkable accuracy even in later trimesters. Dive into the details of improved accuracy that surpasses current methods!
~3 min • Beginner • English
Introduction
The study addresses the challenge of accurately estimating gestational age (GA), particularly in settings where many women first present for antenatal care after 14 weeks’ gestation and where last menstrual period (LMP) recall is unreliable. Ultrasound biometry is accurate in the first trimester (via crown–rump length, CRL), but its accuracy declines in the second and third trimesters because it assumes average fetal size and does not account for increasing biological size variability and growth aberrations (SGA/LGA). This leads to prediction intervals exceeding ±2 weeks after 32 weeks and systematic errors in growth-restricted or large fetuses. The authors hypothesize that machine learning applied to the appearance (not size) of standard ultrasound planes can capture gestational-age–related maturation features, enabling accurate GA estimation in the second and third trimesters without biometry or scale information. Using standardized, prospectively collected datasets with first-trimester dating as ground truth, they aim to train and validate such models across diverse populations.
Literature Review
Prior work in ultrasound and machine learning includes applications in image registration, classification, and regression. Existing automated methods that derive GA from biometry inherit the same limitations as clinical measurements. Some ML attempts have focused on a single standard plane (e.g., fetal head) or cine loops, limiting generalizability. Biometry-based dating methods (Hadlock; INTERGROWTH-21st) perform well early but degrade substantially beyond 32 weeks due to increased biological variability and growth pathologies. The present work explores multi-plane, appearance-only models to overcome size-related biases and improve late-pregnancy GA estimation.
Methodology
Study design: A machine-learning approach was developed to estimate GA from ultrasound image appearance alone using standard planes: head circumference (HC; axial at thalami), abdominal circumference (AC; axial), and femur length (FL; longitudinal). All calibration markings, scalebars, and measurement annotations were removed prior to modeling.
Datasets: Training/internal validation/testing used the INTERGROWTH-21st Fetal Growth Longitudinal Study (FGLS; 2009–2014; eight sites across Brazil, China, India, Italy, Kenya, Oman, UK, USA) comprising low-risk singleton pregnancies with rigorous first-trimester dating (certain LMP corroborated by CRL, discrepancy ≤7 days). External validation used the INTERBIO-21st Fetal Study (2012–2019; six sites in Brazil, Kenya, Pakistan, South Africa, Thailand, UK), a higher-risk, heterogeneous cohort dated by CRL <14 weeks. Data were longitudinal with even GA distribution across 13+0 to 42+0 weeks. FGLS provided 293,811 images from 4,233 pregnancies, split per fetus: 75% train (219,974), 15% validation (44,173), 10% test (29,664). External validation used 94,832 images from 2,443 pregnancies.
Preprocessing: All scale/measurement information was removed via template matching and bilinear inpainting of annotations. Images were resized to 224×224 (bilinear), intensity-normalized (zero mean, unit SD). A sonographer validated a subset (n=100 per plane) post-resizing.
Model architecture and training: A ResNet-50 backbone with skip connections was used. Single-plane models (HC-only, AC-only, FL-only) were pre-trained with Consistent Ordinal RAnk Logits (CORAL) to preserve ordinal week information while providing stable classification-style training (GA binned to integer weeks). A MultiPlane model concatenated the final layers of the three plane-specific networks (initialized with the pre-trained weights) and was fine-tuned using multiple planes from the same fetus, trained end-to-end with L1 loss (regression) to reduce sensitivity to outliers. Training/validation/testing were performed on the FGLS split with strict per-fetus separation; the INTERBIO-21st dataset was reserved exclusively for external testing. Implementation used PyTorch 1.1.0 on NVIDIA Tesla V100. Approximate training time: ~48 h per single-plane model; ~12 h for MultiPlane fine-tuning. Inference speed averaged 39 frames/s (95% CI: 35–43), sufficient for real-time use.
Evaluation: Performance was assessed by mean absolute error (MAE, days), 95% CI for MAE, R^2, and proportions within ±7 and ±14 days vs. ground truth (CRL-based or certain LMP corroborated by CRL). Analyses were stratified by GA windows (14–19+6, 20–25+6, 26–31+6, 32–37+6, ≥38 weeks), trimester (18–27+6 vs. 28–42 weeks), cohort (overall, SGA, LGA), and external study sites. Saliency mapping verified anatomical focus within fetal regions.
Key Findings
- MultiPlane vs single-plane: MultiPlane outperformed HC-only, AC-only, and FL-only by approximately 1, 2, and 3 days in MAE, respectively, across gestation. Saliency maps indicated reliance on fetal anatomical features.
- Internal validation (INTERGROWTH-21st): Across 13+0–42+0 weeks, MultiPlane MAE = 3.5 days (95% CI 3.4–3.7); 90.7% within ±7 days; R^2 ≈ 0.99. Second trimester (18+0–27+6): MAE = 3.0 (2.9–3.2); 94.5% within ±7 days; R^2 = 0.96. Third trimester (28+0–42+0): MAE = 4.3 (4.1–4.5); 85.6% within ±7 days; R^2 = 0.94.
- External validation (INTERBIO-21st): Across 13+0–42+0 weeks, MultiPlane MAE = 4.1 (4.0–4.2); 85.1% within ±7 days; R^2 = 0.99. Second trimester: MAE = 3.7 (3.6–3.9); 88.1% within ±7 days; R^2 = 0.94. Third trimester: MAE = 5.0 (4.8–5.1); 78.1% within ±7 days; R^2 = 0.90.
- Comparison with biometry-based GA dating: Early gestation performance was comparable, but beyond 32 weeks MultiPlane was substantially more accurate. External validation examples: 32–37+6 weeks MAE (days): Hadlock 8.7 (8.4–8.9), INTERGROWTH-21st 7.9 (7.7–8.2), MultiPlane 5.2 (5.0–5.4); ≥38 weeks: Hadlock 13.9 (13.2–14.5), INTERGROWTH-21st 10.6 (9.7–11.4), MultiPlane 7.5 (6.4–8.7).
- Robustness in growth-restricted and large fetuses: In INTERBIO-21st, overall MAE for SGA = 3.7 (3.5–3.9) days and LGA = 4.7 (4.3–5.1) days (second and third trimesters). Beyond 32 weeks, MultiPlane MAE vs Hadlock: SGA 4.7 vs 7.4 days; LGA 5.3 vs 10.6 days.
- Site-wise external validation: MultiPlane performance was consistent across sites (overall MAE 4.1 days; site-specific MAEs within ±0.5 days of the pooled value), indicating robust generalization across diverse settings.
Discussion
The models demonstrate that GA can be accurately inferred from the appearance of standard ultrasound planes without any reliance on scale or biometry. This addresses the core limitation of late-pregnancy biometry—assumptions of average size and sensitivity to growth aberrations—by leveraging maturation-related image features. MultiPlane improved accuracy particularly after 32 weeks, where clinical biometry performance degrades most. Validation on a large, prospectively collected external dataset, including higher-risk pregnancies and multiple international sites, supports generalizability and clinical relevance. The approach maintained accuracy in SGA and LGA fetuses, where biometry often misestimates GA, thus reducing gestation-dependent bias and potentially improving clinical decisions related to preterm birth assessment and growth surveillance. The system operates in real time and uses routinely acquired planes, facilitating integration into standard workflows and potentially improving care in LMICs where late presentation is common. Compared with prior ML methods limited to a single plane or requiring video and/or biometry, the MultiPlane, appearance-only model provides superior accuracy in later gestation.
Conclusion
This work establishes that multi-plane, appearance-based machine learning can accurately estimate gestational age in the second and third trimesters without biometry or scale information. The MultiPlane model consistently outperforms standard biometry-based methods in late pregnancy and remains accurate in SGA and LGA fetuses, offering a more reliable alternative for late presenters and resource-limited settings. Future work should assess domain adaptation across different ultrasound machine vendors, extend evaluation to specific fetal anomalies and broader pathologies, integrate automated standard plane detection to ease acquisition, and explore deployment on low-cost point-of-care devices for wider LMIC implementation.
Limitations
- Domain shift: All images were acquired on the same ultrasound machine type across sites, so performance on images from other vendors may require fine-tuning or domain adaptation.
- Dependency on standard planes: The method requires correctly acquired HC, AC, and FL planes, necessitating trained sonographers and potentially introducing operator variability.
- Fetal anomalies: Although abnormalities were not excluded, performance in specific anomalies (e.g., achondroplasia case with >3-week error at 36 weeks) warrants further study.
- Implementation constraints in LMICs: Availability of trained personnel and barriers such as device cost, maintenance, and repair could affect deployment.
- Training data composition: Original training emphasized a low-risk cohort; although externally validated on a higher-risk population, further validation in varied clinical contexts and machine settings is desirable.
Related Publications
Explore these studies to deepen your understanding of the subject.