
Health and Fitness
Self-supervised learning of accelerometer data provides new insights for sleep and its association with mortality
H. Yuan, T. Plekhanova, et al.
This study conducted by Hang Yuan, Tatiana Plekhanova, and others delves into the effectiveness of wrist-worn accelerometers for classifying sleep stages and examines how sleep duration and efficiency relate to mortality risk. The findings reveal that short sleep duration may increase mortality risk, regardless of sleep quality.
~3 min • Beginner • English
Introduction
The study addresses the challenge of accurately measuring sleep stages (wake, NREM, REM) outside laboratory settings. While polysomnography is the gold standard, it is impractical at scale due to cost and complexity. Large-scale use of wrist-worn accelerometers is feasible, but existing algorithms are often proprietary, based on hand-crafted features, and validated in small cohorts, raising concerns about generalizability. The research aims to: (1) determine how accurately wrist-worn accelerometers can classify sleep stages by leveraging self-supervised deep learning trained against polysomnography, and (2) investigate the association of device-measured overnight sleep duration and sleep efficiency with all-cause mortality in a large population cohort (UK Biobank). This work is important for scalable sleep phenotyping in epidemiology and potential clinical applications.
Literature Review
The paper notes that prior non-PSG sleep classification techniques largely rely on heuristic, hand-crafted features (e.g., device angle) and are commonly proprietary with validation in small samples, limiting generalizability. Sleep diaries, frequently used in epidemiologic studies, correlate poorly with device-based measures. While actigraphy-based sleep monitoring is popular, existing algorithms may not fully exploit raw accelerometry signals. The authors position data-driven deep learning—especially self-supervised representation learning—as advantageous for extracting richer features from raw signals. They compare their approach against baseline models using hand-crafted features and reference prior work on activity/sleep phenotyping in large cohorts and device validation literature.
Methodology
Model development and labels: A deep recurrent neural network (SleepNet) was developed to classify 30-second windows of tri-axial wrist accelerometer data into sleep stages. Polysomnography (PSG) labels were scored per American Academy of Sleep Medicine (AASM) protocol into five categories: wake, REM, and NREM stages I–III. For evaluation, NREM stages were collapsed into a single NREM class, yielding a three-class task (wake/REM/NREM); a two-class task (sleep vs wake) was also assessed. In total, 1,157,913 30-second sleep windows were used to train the network.
Architecture and training: SleepNet comprises (1) a 1D convolutional ResNet-17 v2 feature extractor, (2) a bidirectional LSTM for temporal dependency modeling, and (3) fully connected layers for stage prediction. The model was pre-trained using multi-task self-supervised learning on approximately 700,000 person-days of unlabelled UK Biobank accelerometry to learn motion dynamics by discriminating spatiotemporal transformations. This pretraining aimed to improve downstream classification. The feature extractor was then fine-tuned with PSG-aligned accelerometer data.
Cohorts and validation: Internal validation used subject-wise five-fold cross-validation across Raine Gen2, Raine Gen1, and Newcastle cohorts. External validation used UK Biobank and Leicester cohorts. After preprocessing, 1113 participants were included in internal validation and 53 participants in external validation. Baseline models using hand-crafted features were implemented for comparison.
Devices and preprocessing: Accelerometry was collected with ActiGraph GT3X, ActivPAL X32, and GENEActiv Original devices (previously shown to have high agreement in sedentary and sleep-related estimates). Data were processed using the Biobank Accelerometer Analysis Tool. Signals were resampled to 10 Hz and limited to ±3 g, segmented into consecutive 30-second windows. Non-wear was defined as stationary periods (x/y/z ≤ 1 g) exceeding 60 minutes. Data that were unparseable, had unrealistic high values (>200 mg), or were poorly calibrated were excluded.
Time-in-bed and sleep window detection in UK Biobank: Since UK Biobank lacked concurrent sleep diaries, a random forest model with Hidden Markov Model smoothing was trained to detect time-in-bed (TIB) using diary-annotated free-living data from 152 participants, achieving >90% precision and recall. In free-living application, TIB windows within 60 minutes were merged. SleepNet was then applied to the longest noon-to-noon window per day to estimate overnight sleep duration.
Wear-time requirements and exclusions: Simulations on participants with complete 7-day wear data indicated that stable weekly sleep estimates (ICC > 0.75) required ≥22 h/day wear for at least 3 days. To minimize confounding from circadian disruption, shift workers and participants with daylight saving time effects during wear were excluded. Before deploying to UK Biobank for association analyses, participants with self-reported or hospital-ascertained sleep disorders were excluded, yielding 66,262 participants.
Health association analysis: Associations of overnight sleep duration and sleep efficiency with incident all-cause mortality were estimated using Cox proportional hazards models with age as the timescale. Outcomes were ascertained from national death registries (NHS Digital for England and Wales; NHS Central Register/National Records of Scotland for Scotland). Participants were censored at UK Biobank’s mortality data censoring dates (2021-09-30 England/Wales; 2021-10-31 Scotland). Main models adjusted for sex, ethnicity, Townsend Deprivation Index (quartiles), educational qualifications, smoking status, alcohol consumption (Never, <3 times/week, 3+ times/week), and overall activity (milli-g). The median sleep efficiency (85%) was used to stratify low vs high efficiency groups. Descriptive analyses also considered age, sex, physical activity level, chronotype, insomnia symptoms, and self-rated health.
Key Findings
Model agreement with PSG: On internal validation, mean bias (Bland–Altman) was 9.9 min for total sleep duration (95% LoA: −10.05 to 120.4 min), −24.4 min for REM (95% LoA: −136.7 to 87.8 min), and 34.4 min for NREM (95% LoA: −106.4 to 175.1 min). On external validation, mean bias was 48.2 min for total sleep duration (95% LoA: −50.3 to 146.8 min), −17.1 min for REM (95% LoA: −56.7 to 91.0 min), and 31.1 min for NREM (95% LoA: −67.3 to 129.5 min). The model tended to underestimate REM and short sleep and overestimate NREM.
Classification performance: Pooled internal validation F1: 0.75 ± 0.10 (sleep/wake, two-class) and 0.57 ± 0.11 (wake/REM/NREM, three-class). External validation F1: 0.66 ± 0.12 (two-class) and 0.49 ± 0.10 (three-class). The self-supervised SleepNet outperformed baseline models using hand-crafted features.
UK Biobank descriptive sleep metrics (n=66,262): Mean overnight sleep 6.8 ± 0.9 h/day; NREM 5.3 ± 0.9 h/day; REM 1.5 ± 0.6 h/day; sleep efficiency 81.5 ± 8.6%. Patterns varied by age, sex, activity levels, weekday/weekend, season, and health indicators as detailed in provided tables.
Mortality associations: Over 452,652 person-years with 1,644 deaths, short sleepers (≤6 h) had higher all-cause mortality risk than those with normal sleep duration (6–7 h): HR 1.36 (95% CI: 1.18–1.58) in low sleep efficiency (<85%) and HR 1.29 (95% CI: 1.04–1.61) in high sleep efficiency (≥85%). Risk decreased approximately linearly with increasing sleep efficiency. A non-linear association was observed for overnight sleep duration; longer sleep was not at higher risk than the reference category. Adjustment for BMI slightly attenuated associations.
Subgroup sleep architecture findings: Morning chronotype showed earlier sleep opportunity (~1 hour) than evening chronotype. More physically active participants had slightly shorter sleep (~30 min). Participants with self-reported insomnia symptoms tended to have longer overnight sleep but lower sleep efficiency and a higher average probability of REM during the overnight window. Weekday vs weekend showed similar architecture with slight weekend phase delay.
Discussion
The study demonstrates that a self-supervised deep learning model applied to wrist accelerometry can classify sleep stages with fair-to-moderate agreement against PSG, addressing the first research aim. Agreement metrics and F1 scores were acceptable for large-scale phenotyping, though performance declined on external validation. Deploying the model to the UK Biobank enabled detailed, objective estimation of sleep duration and efficiency at scale, surpassing the granularity of self-reports.
For the second aim, device-derived overnight sleep duration and sleep efficiency were prospectively associated with all-cause mortality. Short sleep (≤6 h) was consistently linked to higher mortality risk across both low and high sleep efficiency strata, indicating that short overnight sleep confers risk irrespective of continuity. Increasing sleep efficiency was associated with lower risk, and longer sleep did not show excess risk compared to the reference group in this analysis. These findings reinforce the importance of sufficient overnight sleep duration and good sleep continuity for health and support the use of wearable-derived sleep phenotypes in epidemiologic and potentially clinical risk stratification.
Subgroup analyses illustrated plausible sleep–wake patterns across chronotype, activity level, and insomnia symptoms, lending face validity to the model-derived sleep architecture in free-living settings.
Conclusion
This work introduces SleepNet, a self-supervised deep recurrent neural network for sleep staging from wrist accelerometry, showing fair-to-moderate agreement with PSG and outperforming hand-crafted feature baselines. Applied at scale in the UK Biobank, the approach provided fine-grained estimates of overnight sleep duration and efficiency and revealed that short overnight sleep is associated with higher all-cause mortality regardless of sleep efficiency, while higher sleep efficiency is linked to lower risk. The study demonstrates the feasibility and value of large-scale, device-based sleep phenotyping for population health research. Future work could enhance staging accuracy, especially for REM and NREM differentiation, integrate multimodal sensors, expand validation across diverse cohorts and devices, and investigate cause-specific mortality and morbidity outcomes.
Limitations
- Model bias and class-specific errors: The model tends to underestimate REM and short sleep and overestimate NREM. Three-class F1 scores were modest, especially on external validation, indicating limited stage discrimination compared to PSG.
- External validation size and performance: External validation included 53 participants and showed reduced agreement relative to internal validation, potentially limiting generalizability.
- Lack of concurrent sleep diaries in UK Biobank: Time-in-bed and sleep windows were inferred via a random forest model with HMM smoothing rather than collected directly, introducing potential misclassification.
- Cohort exclusions: Shift workers, participants with daylight saving time effects during wear, and those with sleep disorders were excluded, which may limit applicability to populations with circadian disruption or sleep pathology.
- Device-only modality: Reliance on accelerometry alone (without additional physiological signals) may constrain staging accuracy compared to PSG.
- Wear-time requirements: Stable estimates required high wear time (≥22 h/day for ≥3 days), which may not always be achievable in all studies.
Related Publications
Explore these studies to deepen your understanding of the subject.