logo
ResearchBunny Logo
Deep representation learning identifies associations between physical activity and sleep patterns during pregnancy and prematurity

Medicine and Health

Deep representation learning identifies associations between physical activity and sleep patterns during pregnancy and prematurity

N. G. Ravindra, C. Espinosa, et al.

Discover groundbreaking research on preterm birth (PTB) by Neal G. Ravindra and colleagues, utilizing innovative wearable device data and deep learning methodologies. This study reveals critical insights into how deviations in sleep and activity patterns during pregnancy are linked to PTB, offering promising strategies for intervention, especially in underserved regions.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the urgent public health problem that preterm birth (PTB) is the largest cause of under-five mortality. While many predictive models for PTB exist, effective, scalable, and low-risk interventions remain scarce. Wearable devices can objectively capture physical activity and sleep—behaviors linked to stress, inflammation, and pregnancy outcomes—yet integrating wearables into clinical practice is hindered by continuous, long, low-dimensional data that are difficult to analyze, and by the inaccuracy of self-reported questionnaires. Prior work indicates pregnancy-related circadian disruption and suggests a relationship between sleep/activity patterns and adverse outcomes, but the extent and dynamics during pregnancy are insufficiently characterized. The authors hypothesize that deep learning on wearable-derived actigraphy can model pregnancy progression (using gestational age, GA, as a surrogate) and that deviations from a learned normative trajectory of sleep and activity are associated with increased PTB risk. The purpose is to develop a general analytical pipeline to monitor pregnancy from wearables, interpret the underlying behavioral signals, and link them to clinical outcomes, enabling low-cost interventions, particularly relevant for LMICs.
Literature Review
Prior research has focused on PTB prediction using clinical, biological, genetic, sociodemographic, and environmental data; however, interventions are limited. Wearables have been shown to estimate sleep quality and stress-related variables but suffer from analytical challenges and have not been extensively applied to pregnancy monitoring. Existing actigraphy analysis commonly uses non-parametric summary metrics (e.g., interdaily stability, intradaily variability, relative amplitude) or proprietary toolboxes, which provide limited resolution and often fail to differentiate PTB risk groups robustly. Evidence indicates pregnancy induces circadian changes and that sleep/activity disruptions may affect maternal–fetal health and timing of birth. Recent advances in time-series deep learning (e.g., ResNet/InceptionTime) and feature attribution provide opportunities to move beyond black-box models to interpretable pipelines suitable for clinical insights.
Methodology
Design and cohort: A cohort of N = 1083 pregnant individuals contributed N = 2305 one-week actigraphy segments aligned to gestational age (GA) measurements predominantly established by LMP and <14-week ultrasound. Electronic health record (EHR) metadata included demographics, comorbidities, pregnancy outcomes, and multiple survey instruments (e.g., PSQI, KPAS, Epworth Sleepiness Scale). Data were split by patient into training (Npatients=658; 1399 samples), validation (100; 216), and test (325; 690). A sensitivity analysis stratified spontaneous vs. iatrogenic (induction indicated) births. Data acquisition and preprocessing: Actigraphy (MotionWatch8) captured accelerometer counts and ambient light at 1-min resolution. Raw time series were log10(x+1) transformed; sequences were standardized to 1 full week (10,080 minutes) after discarding the first recording day to avoid padding misalignment. Automatic sleep–wake annotation used the Oakley algorithm (threshold θ=80) via pyActigraphy. Quality filters excluded sequences with missing light, minute gaps, short recordings, or metadata mismatches. Non-parametric actigraphy metrics (IS, IV, RA, etc.) were computed for comparison. Model architecture (series2signal): The authors adapted an InceptionTime/ResNet-inspired 1D CNN for time-series regression to predict GA from multivariate inputs (activity, light). Modifications included replacing batch normalization with layer normalization within inception blocks, adding a non-linear prediction head to obtain both scalar GA predictions and a fixed-length latent embedding (128-D) summarizing the week. The pipeline integrates: (1) supervised regression to GA; (2) a data augmentation framework; (3) post-hoc model error analysis; (4) gradient-based feature attribution; (5) a model utility module leveraging learned embeddings for ancillary tasks. Training and augmentation: Optimization used Adam with MSE loss plus small L1 and weight decay penalties (λ1≈1e−6; λ2≈0.001). A learning rate scheduler reduced LR on plateau (patience 10). To mitigate overfitting with limited data, a novel augmentation scheme selected one transform per mini-batch or per epoch from jittering (σ=0.03), scaling (α~N(1,0.2)), window warping (10% window, 2x time warp), and window slicing (crop 90% then resize); rotation was excluded due to degraded performance. At inference, outputs were averaged across augmentation queries. Comparative baselines included classical ML (kNN, Random Forest, Gradient Boosting, TimeSeriesForest) and DL (GRU, VGG-1D, InceptionTime) with standardized evaluation. Error analysis: After training, predictions were compared to actual GA to compute error (model − actual GA). Samples were grouped into three modes: higher-than-actual GA, lower-than-actual GA, and low-error (|error|<10 weeks; threshold chosen to balance group sizes). An automated tabular association function (handling mixed continuous/categorical data) and a correlation-network approach identified metadata variables differing across error groups. Permutation tests (≥1000 iterations) compared observed vs. expected prevalence within error modes to assess robustness and generalization to the test set. Interpretability: Gradient-based feature attribution (integrated gradients; absolute scores) quantified timepoint-level importance, summarized over sleep vs wake and time-of-day/week strata. Statistical tests (Mann–Whitney U, Spearman) assessed differences in feature importance and error across groups (error modes; PTB+/PTB−; time segments). Discriminability of sleep vs wake using importance and error was evaluated via logistic regression AU-ROC. Embeddings and ancillary tasks: The 128-D embeddings were used with lightweight kNN classifiers/regressors to predict a broad set of metadata variables, compared against using raw preprocessed actigraphy (with DTW and unsupervised clustering baselines). UMAP visualizations and graph-based community detection assessed phenotypic clustering and semantic enrichment. Evaluation metrics: Primary regression metrics were MAE and Spearman’s ρ. Classification tasks used AU-ROC/AU-PRC (with adjustments for class imbalance); clustering evaluated via semantic enrichment. Multiple testing corrections (Bonferroni) were applied where appropriate; non-parametric tests used Mann–Whitney U, Fisher’s exact, Kruskal–Wallis, χ², and permutation testing.
Key Findings
- Model performance: series2signal achieved top-1 mean absolute error (MAE) ≈ 7.52–7.60 weeks and Spearman’s ρ ≈ 0.45 (P < 0.001) on the held-out test set, outperforming seven ML/DL baselines (e.g., vs InceptionTime: MAE improvement 1.36 weeks; P = 7.39e−13). TimeSeriesForest, while competitive, had lower correlation (ρ ≈ 0.31 vs 0.40–0.45 for series2signal). - Standard actigraphy metrics and surveys: Non-parametric metrics (IS, IV, RA) showed minimal or no significant differences between PTB+ and PTB− and weak correlations with GA, underscoring the need for learned representations. - Error modes and PTB risk: Deviations between model-predicted GA and actual GA strongly associated with PTB. • Lower-than-actual GA: 0.52× the expected prevalence of PTB (P = 1.01e−67, permutation test). • Higher-than-actual GA: 1.44× the expected prevalence of PTB (P = 2.82e−39, permutation test). • On test set, similar enrichment/depletion persisted (e.g., ≈1.5× enrichment and ≈2.0× depletion; P < 0.001). - Circadian rhythmicity link: Model error negatively correlated with interdaily stability (IS), indicating more advanced GA is assigned when daily rhythms are less precise (e.g., P = 0.043 in abstract; corroborated with significant correlations in detailed analyses). - Sleep-specific importance: Feature attribution emphasized sleep periods when predicting higher-than-actual GA compared to lower-than-actual GA (P = 1.01e−21). Importance better discriminated sleep vs. wake than error (AU-ROC ≈ 0.94 using importance; error AU-ROC ≈ 0.47–0.53). - Consistency across delivery types: Trends held for spontaneous and iatrogenic births; enrichment of PTB in higher-than-actual error modes and depletion in lower-than-actual modes remained significant by permutation testing. - Embedding utility: Learned 128-D embeddings modestly but significantly improved predictability for ancillary metadata tasks versus raw actigraphy (average predictability 0.46 vs 0.45; P = 0.01), especially for social determinants, sleep quality, comorbidities, and stress; raw data favored explicit activity-related targets. Embedding-based clustering revealed semantically meaningful groups (e.g., depression-enriched, comorbidity-enriched, and healthier clusters).
Discussion
The findings support the hypothesis that a deep learning model trained solely on one week of wearable-derived actigraphy can monitor pregnancy progression and that deviations from the learned behavioral “clock” of pregnancy relate to adverse outcomes, particularly PTB. The model’s error relative to actual GA operationalizes behavioral deviations: overestimation of GA signals disrupted sleep/activity patterns and higher PTB prevalence, whereas underestimation aligns with healthier patterns and lower PTB prevalence. The negative correlation between model error and interdaily stability, alongside higher attribution to sleep during overestimation, links circadian disruption—especially during sleep—to elevated risk. This framework thus provides an interpretable, behaviorally grounded signal from inexpensive, passively collected data. Clinically, combining prediction with interpretability can guide decision support, enabling targeted counseling and behavioral interventions (e.g., sleep hygiene, activity regularity) and prioritizing individuals for closer monitoring. The learned embeddings further enable rapid phenotyping and ancillary predictions relevant to maternal health, suggesting broader utility for wearables in obstetrics.
Conclusion
This work introduces series2signal, a deep learning and interpretability pipeline that models gestational age from wearable actigraphy and uses deviations between predicted and actual GA as an actionable indicator of prematurity risk. The approach outperforms multiple ML/DL baselines, links error modes to PTB prevalence, highlights sleep-period disruptions as key drivers, and provides meaningful embeddings for ancillary tasks and patient phenotyping. The study underscores the potential for scalable, low-cost monitoring and behavior-focused interventions to mitigate PTB risk, particularly in LMICs. Future directions include randomized trials testing sleep/activity interventions, more sophisticated sleep detection and disentangling sleep vs. activity effects, larger and more diverse cohorts for external validation, improved augmentation (including latent-space approaches), integration of non-pregnant data to enhance later-GA accuracy, and coordinated biospecimen collection to connect behavioral signals with biological mechanisms.
Limitations
- Dataset size and generalizability: Although substantial for wearables, the cohort (N=1083; 2305 samples) is modest for obstetrics, potentially limiting generalizability and external transportability (particularly outside the US). The model best compares pregnancies within-cohort. - Sleep detection simplicity: Sleep–wake annotation used a rule-based algorithm; more advanced methods may capture subtler sleep disruptions affecting PTB risk. - Behavioral disentanglement: The GA “clock” combines sleep and activity effects; explicit decomposition was not performed. - Data quality and span: Analyses required full-week recordings; measurements with device non-wear were excluded, possibly biasing samples. The pipeline currently assumes 8 days of recording to construct a clean 7-day window. - Augmentation and modeling scope: Augmentations were time/frequency domain; latent space augmentations may further improve generalization. Incorporating non-pregnant data as a source for augmentation may improve accuracy at later GA. - External validation and causality: Findings are associative. Randomized controlled trials and targeted biological sampling are needed to establish causation and biological underpinnings.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny