
Medicine and Health
Development of digital measures for nighttime scratch and sleep using wrist-worn wearable devices
N. Mahadevan, Y. Christakis, et al.
This innovative study delves into the challenges faced by patients with atopic dermatitis, focusing on nighttime scratching and sleep disturbances. Conducted by a team of researchers from Pfizer and the University of Rochester, the research introduces a groundbreaking method utilizing wrist-worn accelerometer data to objectively measure scratching and sleep, demonstrating strong correlations with established sleep measures.
~3 min • Beginner • English
Introduction
Atopic dermatitis (AD) is characterized by chronic pruritus, which often intensifies at night, prompting scratching that exacerbates inflammation and contributes to an itch-scratch cycle. Nocturnal itch and resultant scratching disrupt sleep, degrading quality of life for patients and caregivers. Traditional assessments rely on clinician-reported outcomes (e.g., body surface area and lesion severity) and patient-reported outcomes (PROs), which can be subjective, influenced by mood, and lack granularity outside clinical settings. There is a need for objective, continuous measures that reflect real-world symptom burden and enable reliable evaluation of interventions. Wrist-worn accelerometers have been used to measure sleep-wake patterns over extended periods, offering a practical alternative to polysomnography (PSG) for monitoring sleep quantity and circadian rhythms, though they cannot reliably stage sleep. Recent efforts have also applied accelerometry and machine learning to detect nighttime scratching. However, many existing methods rely on simulated scratch, lack automated sleep context detection, or are not optimized for unsupervised, free-living monitoring. This study aims to develop and analytically validate a hierarchical, device-agnostic method that automatically identifies sleep periods, classifies sleep/wake and scratch movements, and derives nightly digital endpoints for sleep quantity and nighttime scratching from wrist accelerometry, comparing outputs with PSG and annotated thermal video reference data.
Literature Review
Prior work in actigraphy has established wrist accelerometers as useful for long-term monitoring of sleep/wake cycles and changes in sleep quantity, though PSG remains the gold standard. Early scratch detection approaches used hand-crafted features and unsupervised (k-means) or simple supervised models (logistic regression) to distinguish simulated scratch from other movements, achieving high sensitivity but limited generalizability to free-living conditions due to reliance on simulated data and absence of sleep context detection. More recent methods using RNNs trained on real annotated overnight data improved continuous detection but lacked interpretability and did not segment sleep periods, risking false positives in free-living settings. A smartwatch-based heuristic approach demonstrated feasibility but had small validation samples and lacked automatic sleep detection. Collectively, the literature underscores the need for validated, interpretable, and scalable methods that integrate sleep context and scratch detection for unsupervised, daily-life monitoring.
Methodology
Study population and design: Forty-five AD patients (age 12–63 years) were recruited per Hanifin and Rajka criteria with ISGA ≥2 and BSA ≥5%; ppNRS ≥3 and SPS ≥1 at screening. Participants completed four nights of monitoring: two in-clinic sleep lab nights (with thermal videography both nights; limited PSG on the second night) and two at-home nights. Due to missing data (accelerometry, video malfunctions, or misalignment), 33 participants were included for algorithm development/validation (age: 31.1 ± 15.8 years; 30.3% male). Ethics approval and informed consent were obtained.
Instrumentation: Participants wore GeneActiv Original devices on both wrists, logging triaxial accelerometry (100 Hz), ambient light (100 Hz), and near-body temperature (0.334 Hz). Devices were worn continuously starting at least 3 hours prior to the first in-clinic night. PSG (second in-clinic night) included EEG (C3, C4, Occipital), EOG (2), and facial EMG (2) scored in 30-s epochs. Thermal videography was recorded at 60 Hz for the overnight period.
Analytical framework: A hierarchical pipeline processes raw accelerometry to derive nightly sleep and scratch endpoints. Context detection includes non-wear detection, total sleep opportunity (TSO) window detection (largest nightly period wherein sleep is intended), and hand movement detection within TSO. Symptom estimation includes sleep/wake classification and scratch classification, followed by endpoint computation.
Sleep module: Accelerometry was downsampled from 100 Hz to 20 Hz. Data were segmented into daily windows (12:00 PM–12:00 PM). Days with <6 h of data were excluded. Non-wear was identified via near-body temperature: processed temperature signals (5-s rolling median, consecutive 5-s average, 5-min rolling median) with values <25 °C flagged as non-wear. Candidate TSO periods were identified heuristically using changes in arm angle derived from accelerometer axes; non-wear periods were excluded. The longest remaining candidate per day was selected as TSO. Sleep/wake was classified per-minute using an open-source activity index as a proxy for activity counts, followed by Webster’s rescoring rules to improve specificity. Sleep endpoints computed within TSO included TST, percent time asleep (PTA), wake after sleep onset (WASO), sleep onset latency (SOL), and number of wake bouts (NWB).
Scratch module: Within the TSO, accelerometry was segmented into non-overlapping 3-s windows (selected after testing 1–3 s; 3 s balanced resolution and performance). A heuristic hand-movement detector (threshold on rolling coefficient of variation, tuned at the 25th percentile, 0.023) determined presence of hand movement. Only windows with continuous hand movement were passed to a binary scratch classifier. Training labels were generated from thermal video annotations (two annotators, arbitrated disagreements) marking scratch vs restless (non-scratch) movements, hand involved, body location, and severity. Annotations ≥3 s were used; longer annotations were split into overlapping 3-s windows (50%). Time alignment used a clap synchronization event.
Scratch classifier training: Preprocessing applied a first-order Butterworth high-pass filter (0.25 Hz) to remove gravity, then computed signal vector magnitude (SVM) and principal components (PC1, PC2) to reduce orientation dependence. Thirty-six time/frequency features were extracted per window; class balancing via random sampling was applied before feature selection. Recursive feature elimination with cross-validation (decision tree estimator) selected 26 features. A random forest (50 estimators; larger settings offered no improvement) was trained. Model evaluation used leave-one-subject-out validation.
Statistical analysis: Epoch-level performance for sleep (30-s PSG comparison, visit 2) was evaluated separately for each wrist. Scratch epoch-level performance used leave-one-subject-out across both in-clinic nights, pooling wrists. Metrics included accuracy, sensitivity, specificity, F1, and AUC-ROC. SHAP was used to interpret feature importance for scratch classification. Endpoint-level agreement was assessed via Pearson correlations (with p-values) and Bland–Altman mean bias and limits of agreement. Sleep endpoints were averaged across wrists; scratch endpoints were summed across wrists. Scratch endpoints were log-transformed (log(x+1)) due to right skew. Associations of scratch endpoints with WASO and TST were also analyzed.
Key Findings
- Participants: 33 AD patients included for analysis (age 31.1 ± 15.8 years; 30.3% male). Most had bilateral wrist data.
- Sleep epoch-level performance (vs PSG, visit 2): accuracy 0.85 ± 0.09 (left), 0.85 ± 0.10 (right); sensitivity 0.95 ± 0.08 (left), 0.95 ± 0.07 (right); specificity 0.44 ± 0.24 (left), 0.44 ± 0.23 (right); F1 0.90 ± 0.07 (left), 0.90 ± 0.09 (right).
- Scratch epoch-level performance (leave-one-subject-out): accuracy 0.73 ± 0.09; sensitivity 0.61 ± 0.15; specificity 0.80 ± 0.10; PPV 0.73 ± 0.17; NPV 0.68 ± 0.17; F1 0.66 ± 0.15; AUC 0.81 (threshold 0.5). 3-s windows outperformed 2-s windows.
- Scratch classifier robustness across severity: specificity comparable across ISGA categories (mild 0.83, moderate 0.80, severe 0.80); higher sensitivity in severe AD (0.83) vs mild (0.54) and moderate (0.66). No clear sex differences.
- Endpoint-level agreement (N=32 for sleep, N=25 for scratch):
• TSO vs PSG: r = 0.72, p < 0.001; mean bias +29.66 min (underestimation of duration by ~29.7 min); limits of agreement −88.63 to 147.94 min.
• TST vs PSG: r = 0.76, p < 0.001; mean bias −24.19 min (overestimation by ~24.2 min); limits −148.14 to 99.75 min.
• PTA vs PSG: r = 0.41, p = 0.019; mean bias −10.17 percentage points; limits −38.89 to 18.55.
• Total scratch counts (log-transformed) vs video: r = 0.63, p < 0.001; bias 0.13; limits −1.16 to 1.41.
• Total scratch duration (log-transformed) vs video: r = 0.82, p < 0.001; bias 0.71; limits −0.22 to 1.63.
- Left vs right wrist sleep endpoints showed high agreement (TSO r=0.83, TST r=0.93, PTA r=0.94; all p<0.001).
- Associations: predicted scratch endpoints correlated strongly with WASO (events r=0.90; duration r=0.82; both p<0.001) and showed weak/non-significant association with TST (events r=0.30; duration r=−0.31).
- SHAP feature analysis for scratch: measures of signal periodicity (dominant frequency, mean cross rate) and smoothness (SPARC, jerk) were most influential; higher mean cross rate and dominant frequency increased scratch likelihood, while smoother movements (higher SPARC, lower jerk) decreased it.
Discussion
The hierarchical, device-agnostic pipeline effectively integrates automatic sleep context detection (TSO) with sleep/wake classification and scratch detection to produce nightly digital endpoints from wrist accelerometry. Accurate TSO detection is crucial for bounding nighttime analysis; despite an average underestimation of ~30 minutes relative to lights-off/lights-on PSG-derived TSO, this aligns with known challenges in defining TSO from behavior and can reflect real participant activity more accurately than fixed lab times. Epoch-level sleep classification performance (∼85% accuracy; high sensitivity but lower specificity for wake) is comparable to prior actigraphy literature. The scratch module, by constraining analysis to hand-movement within TSO and using interpretable features, achieved moderate-to-strong agreement with video-derived endpoints, particularly for total scratch duration, and performance comparable to RNN-based methods while offering better interpretability. Feature importance analyses suggest scratching is characterized by more rapid, periodic, and less smooth movements than restless motion. Clinically, strong correlations between scratch endpoints and WASO, but not TST, suggest scratching contributes to sleep fragmentation rather than reduced sleep duration. The approach is designed for scalability in clinical studies: it operates on sample-level data, minimizes dependence on device orientation via SVM/PCs, uses modest sampling rates to preserve battery life, and supports single- or dual-wrist deployments. However, potential error propagation across hierarchical stages, lower sensitivity for low-intensity scratching, and the need for validation in free-living, longer-term settings remain important considerations.
Conclusion
This work presents a validated, hierarchical method to continuously and objectively quantify nighttime scratch and sleep from wrist-worn accelerometer data. The pipeline automatically detects TSO, classifies sleep/wake and scratch events, and produces interpretable digital endpoints that show good agreement with PSG and video annotations. The approach is device-agnostic, scalable, and suitable for longitudinal monitoring in real-world settings, supporting its use in clinical research and potentially in disease management. Future research should: (1) validate performance and generalizability in extended free-living deployments; (2) evaluate responsiveness and sensitivity to clinically meaningful changes and treatment effects; (3) improve detection of low-intensity scratching and reduce false positives from restless movements; (4) explore multimodal sensing (e.g., EMG, acoustic, vibration) and alternative wear locations (e.g., ring sensors) to overcome accelerometry limitations; and (5) refine TSO detection and non-wear detection across diverse devices.
Limitations
- Sleep/wake detection exhibited low specificity for wake, consistent with actigraphy limitations and the relatively small proportion of wake epochs.
- TSO detection showed an average underestimation (~29.7 min) relative to PSG lights-off/lights-on times, reflecting challenges in aligning behavioral sleep intention with lab timings.
- Scratch detection sensitivity was lower for low-intensity, slower, smoother movements, suggesting a ceiling on wrist-accelerometry sensitivity for subtle scratching.
- Potential false positives due to restless, non-scratch movements may affect scratch endpoint reliability in AD populations with increased nocturnal restlessness.
- Hierarchical design introduces risk of error propagation (e.g., mis-detected TSO affecting downstream scratch/sleep estimates).
- Validation was primarily in-clinic for algorithm development; generalizability to longer free-living monitoring requires further evaluation.
- Non-wear detection relied on near-body temperature, limiting applicability to devices with suitable sensors or requiring alternative non-wear detection methods.
- PSG was limited (no respiration/limb movement/oximetry), and some data exclusions reduced sample sizes for certain analyses.
Related Publications
Explore these studies to deepen your understanding of the subject.