Medicine and Health

Accuracy of 11 Wearable, Nearable, and Airable Consumer Sleep Trackers: Prospective Multicenter Validation Study

T. Lee, Y. Cho, et al.

Consumer sleep trackers were compared head‑to‑head with in‑lab polysomnography in a multicenter study of 75 participants using 11 popular devices (wearables, nearables, airables). Results showed wide variation in sleep‑stage agreement (macro F1 range 0.69–0.26), device‑class biases (wearables toward sleep efficiency; nearables toward sleep latency), and specific strengths across trackers—some demonstrating substantial concordance and clinical applicability. This research was conducted by the authors listed in the Authors tag: Taeyoung Lee, Younghoon Cho, Kwang Su Cha, Jinhwan Jung, Jungim Cho, Hyunggug Kim, Daewoo Kim, Joonki Hong, Dongheon Lee, Moonsik Keum, Clete A Kushida, In‑Young Yoon, Jeong‑Whun Kim.... show more

Introduction

With increasing recognition of sleep’s importance to health, interest in monitoring sleep with consumer sleep trackers (CSTs) has surged. Traditional in-lab polysomnography (PSG) is accurate but cumbersome. CSTs, leveraging sensors and algorithms, offer convenient home monitoring. The study categorizes CSTs by modality: wearables (smartwatches, rings using PPG and accelerometers), nearables (radar or mattress pad devices), and airables (smartphone-based apps using microphones or environmental sensors). Prior validation studies either lacked PSG comparisons or included a limited set of devices and single-center samples. Research questions: How accurately do diverse, widely used CSTs (wearable, nearable, airable) agree with PSG on sleep stage classification and sleep measures? What device-type trends and demographic factors affect performance? Purpose: conduct a multicenter, simultaneous validation of 11 CSTs against PSG to provide comprehensive, objective performance insights.

Literature Review

The paper situates CST validation within existing literature noting: (1) some studies used proxies like EEG headbands or sleep diaries rather than PSG, limiting conclusions on agreement; (2) Chinoy et al compared seven CSTs with PSG but was single-institution and omitted many widely used devices; (3) prior reviews highlighted limitations in airable apps’ agreement with PSG; (4) earlier generations of wearables (e.g., Fitbit, Oura) showed mixed validity, with newer devices integrating multi-sensor and deep learning approaches promising improved staging. The authors emphasize the need for standardized validation frameworks and transparency due to increasing use of opaque deep learning algorithms.

Methodology

Design: Prospective cross-sectional multicenter validation at two South Korean institutions: Seoul National University Bundang Hospital (SNUBH; tertiary) and Clionic Lifecare Clinic (CLC; primary).

Participants: N=75 adults (19–70 years) with subjective sleep discomfort; exclusions included uncontrolled acute respiratory conditions. Recruitment: 37 from SNUBH (patients scheduled for PSG for sleep disorders) and 38 from CLC (online recruitment). Demographics: 52% male (39/75), mean age 43.59±14.10 y, mean BMI 23.90±4.07 kg/m²; notable center differences in time in bed, total sleep time, WASO, and AHI.

Devices: 11 CSTs: Wearables (Google Pixel Watch, Galaxy Watch 5, Fitbit Sense 2, Apple Watch 8, Oura Ring 3); Nearables (Withings Sleep Tracking Mat, Google Nest Hub 2, Amazon Halo Rise); Airables (SleepRoutine, SleepScore, Pillow on iPhone 12 or Galaxy S21). Selection based on market popularity/availability. Devices were updated to latest software as of March 1, 2023; auto-updates thereafter disabled. Participants were trained; wearable fit ensured.

Experimental setup: To avoid biosignal interference, participants were randomized into two multi-tracker groups (A and B), each comprising noninterfering combinations; nearable radar devices (Google Nest Hub 2 vs Amazon Halo Rise) split across groups; max two watches worn (one per wrist). Oura Ring appeared in both groups. SleepRoutine and SleepScore split across iOS/Android; Pillow used on iOS only. Each subject underwent overnight PSG per AASM guidelines in controlled sleep labs; two technicians scored independently with physician review.

Ethics: IRB approvals SNUBH B-2302-908-301 and CLC P01-202302-01-048. Informed consent obtained.

Data preprocessing: Raw sleep stage outputs were extracted from device apps/portals or via manufacturer (SleepRoutine). Stages standardized: wake=0, light=1, deep=2, REM=3 (Apple Watch’s “core” mapped to light). Time synchronization performed relative to PSG: device epochs before PSG start discarded; if device started after PSG, pre-start samples labeled wake; endpoints aligned to PSG end to equalize time in bed. To address epoch boundary misalignment, all signals (including PSG) were resampled to 1-second resolution for second-by-second comparison.

Statistical analysis and metrics: Two-sample t tests for demographics/sleep measures (P<.05). Agreement assessed via accuracy, sensitivity, specificity, F1; class imbalance addressed using macro F1, weighted F1, and Cohen’s kappa. Sleep measure agreement evaluated with mean bias and Bland–Altman plots, with Pearson correlation for proportional bias. Python 3.9.16 with scikit-learn, matplotlib, and scipy used.

Key Findings

Sample and data volume: 75 participants; total CST sleep session time 3890 hours; PSG time 543 hours; ~353 hours per CST on average; total of 349,114 epochs compared.

Overall epoch-by-epoch (4-stage) agreement: Substantial variation across devices. Highest macro F1=0.6863 (SleepRoutine, airable; accuracy 0.7106, weighted F1 0.7166, κ=0.5565). Next: Amazon Halo Rise (nearable) macro F1=0.6242 (accuracy 0.6634, weighted F1 0.6706, κ=0.4807). Wearables with moderate κ: Google Pixel Watch macro F1=0.5669 (accuracy 0.6355, κ=0.4044), Galaxy Watch 5 macro F1=0.5761 (accuracy 0.6494, κ=0.4177), Fitbit Sense 2 macro F1=0.5814 (accuracy 0.6464, κ=0.4185). Apple Watch 8 macro F1=0.4910 (κ=0.2976). Oura Ring 3 macro F1=0.5186 (κ=0.3492). Withings Sleep Tracking Mat macro F1=0.4496 (κ=0.2455). Google Nest Hub 2 macro F1=0.3009 (κ=0.0644). SleepScore macro F1=0.4049 (κ=0.2065). Lowest: Pillow macro F1=0.2588 (κ=0.0741). Average macro F1 similar across centers (SNUBH 0.4973; CLC 0.4876).

Stage-specific performance: SleepRoutine led wake (F1=0.7065) and REM (F1=0.7596) by notable margins; Amazon Halo Rise was second best in both. Deep sleep best with wearables: Google Pixel Watch F1=0.5922 (highest), Fitbit Sense 2 F1=0.5564. Light sleep showed clustered high performance among Pixel Watch, Galaxy Watch 5, Fitbit Sense 2, Amazon Halo Rise, and SleepRoutine (F1≈0.7142–0.7436).

Confusion patterns: Across devices, general bias toward predicting light sleep. Wearables often misclassified wake as light; nearables frequently misclassified REM as light; airables showed more confusion between light and deep. Pillow exhibited a strong deep-stage bias (predicting 59% epochs as deep vs PSG deep 10.8%). Google Nest Hub 2 had the largest light-stage bias among devices.

Sleep measures (Bland–Altman): PSG SE mean 77.57–86.05%; device SE bias ranged from −3.49 pp (Amazon Halo Rise) to +12.80 pp (Google Pixel Watch). PSG sleep latency 10.80–19.80 min; device bias −0.81 min (Apple Watch 8) to +39.42 min (Google Nest Hub 2). PSG REM latency 87.00–112.20 min; device bias −49.89 min (Amazon Halo Rise) to +65.29 min (Google Pixel Watch). Best per metric: SE minimal bias with Galaxy Watch 5 (−0.4%); sleep latency minimal bias Apple Watch 8 (0.81 min); REM latency minimal bias SleepRoutine (1.85 min). Proportional bias: wearables tended to show negative proportional bias in SE; nearables showed positive proportional bias in sleep latency. Oura Ring and SleepRoutine showed no proportional bias across measures.

Subgroup analyses (macro F1): BMI ≤25 vs >25: 0.5043 vs 0.4790 (difference 0.0253). Sleep efficiency ≤85% vs >85%: 0.4757 vs 0.4902 (difference 0.0145). AHI ≤15 vs >15: 0.4905 vs 0.5024 (difference 0.0119). Sex: males 0.4926, females 0.4932 (difference 0.0006). Largest device-specific subgroup differences: SleepScore by AHI (Δ=0.0929), Google Pixel Watch by sleep efficiency (Δ=0.1067), Galaxy Watch 5 by BMI (Δ=0.0785), SleepScore by sex (Δ=0.0872).

Discussion

The study addressed its primary objective by simultaneously validating a broad slate of 11 widely used or newly released CSTs against in-lab PSG across two institutions. Findings show that certain CSTs can achieve moderate agreement with PSG, particularly SleepRoutine (airable), Amazon Halo Rise (nearable), and leading wearables (Google Pixel Watch, Galaxy Watch 5, Fitbit Sense 2), while others showed only fair to slight agreement. Device-type patterns were evident: wearables generally overestimated sleep (misclassifying wake as light), contributing to proportional bias in sleep efficiency; nearables tended to overestimate sleep latency, likely due to reliance on motion/radar signals during prolonged sleep initiation attempts; airables displayed heterogeneous behavior based on sensor type and algorithms, with audio-based SleepRoutine excelling in wake and REM detection. REM was comparatively well detected by top devices, likely due to characteristic autonomic and physiological signatures that multiple sensor modalities can capture. Confusion matrices highlighted common misclassification trends per device type, informing potential algorithmic improvements.

Beyond stage classification, sleep measure analyses demonstrated distinct bias profiles per device type and identified top-performing devices for specific metrics (e.g., Galaxy Watch 5 for sleep efficiency, Apple Watch 8 for sleep latency, SleepRoutine for REM latency). Subgroup effects were modest overall, with slightly reduced performance in higher BMI and lower sleep efficiency groups, and minimal sex differences, suggesting broad applicability but with room for improved robustness in specific populations.

The study underscores the need for standardized validation and greater data transparency in CST development, particularly for deep learning–based systems. Cost considerations show airables as the most economical (subscription-based), nearables mid-range, and wearables higher cost but multifunctional, aiding informed selection based on user priorities. Overall, results provide practical guidance for users and developers on device strengths and limitations and highlight areas for future enhancement.

Conclusion

This multicenter, simultaneous validation of 11 consumer sleep trackers against PSG provides comprehensive evidence on device accuracy across sleep stages and measures. Several CSTs demonstrated moderate agreement with PSG, with device-type trends indicating strengths (e.g., wearables for deep sleep, audio-based airable for wake/REM) and characteristic biases (e.g., wearables’ sleep efficiency proportional bias, nearables’ sleep latency bias). The study informs personalized device selection for sleep monitoring and guides developers on algorithmic improvements. Future research should validate in diverse, multiracial populations and home environments, explore multi-sensor fusion to improve staging accuracy, and promote standardized validation protocols and transparency in training/validation data.

Limitations

Data collection rates varied between institutions due to operational issues (battery/account management, human error), causing data omissions. Participant demographics and operational processes differed across centers (e.g., time in bed, total sleep time), with slightly earlier awakenings at CLC. The study population was exclusively Korean, limiting generalizability across races/ethnicities; broader, multiethnic home-environment validations are needed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Association of Sleep Duration With All-Cause and Cardiovascular Mortality: A Prospective Cohort Study

Q. Jin, N. Yang, et al.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Medicine and Health

Comparing the efficacy and pregnancy outcome of intrauterine balloon and intrauterine contraceptive device in the prevention of adhesion reformation after hysteroscopic adhesiolysis in infertile women: a prospective, randomized, controlled trial study

Q. Zhang, H. Ding, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny