Computer Science

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

H. Yuan, S. Chan, et al.

Discover how Hang Yuan, Shing Chan, Andrew P. Creagh, and their colleagues are revolutionizing human activity recognition with self-supervised learning techniques applied to a massive dataset from the UK Biobank. Their models not only achieve remarkable accuracy but also generalize across various environments and devices, paving the way for advancements in fields with limited labeled data.

00:00

~3 min • Beginner • English

Index

Introduction

Wearable sensors are widely used for fitness, wellness, remote monitoring, clinical trials, population health studies, and personalised medicine. Accurate human activity recognition (HAR) from wrist-worn accelerometers depends on reliable algorithms, yet progress has been limited by the scarcity of large, diverse labelled datasets. Unlike fields such as computer vision and natural language processing that have benefited from massive datasets, HAR research often relies on small, scripted, lab-based datasets, confounding evaluations of deep learning methods and sometimes showing limited gains over simpler models. To overcome this limitation, the study leverages the UK Biobank accelerometer dataset, comprising over 100,000 participants with 7 days of free-living, 24/7 motion data (>700,000 person-days). The research question is whether self-supervised learning (SSL), applied at tera-scale on free-living data, can learn generalisable representations that improve downstream HAR performance across diverse external datasets, devices, and populations, while reducing reliance on labelled data.

Literature Review

Prior work on SSL for sensor-based HAR has explored multi-task self-supervision, masked reconstruction, contrastive learning, and bootstrapping. A recent comprehensive benchmark concluded that multi-task self-supervision can learn the most generic representations applicable to different downstream tasks. However, existing efforts typically pre-trained and fine-tuned on the same datasets or relied on relatively small cohorts (around n=100), limiting generalisability. Some empirical studies suggested deep learning models such as DeepConvLSTM did not significantly outperform simpler feature-based methods on small datasets. The current study builds on these findings by applying multi-task SSL at unprecedented scale using UK Biobank free-living data, addressing domain shift and task shift across multiple external benchmarks. The chosen pretext tasks (arrow of time, permutation, time warping) were originally introduced as data augmentations and later assessed in multi-task SSL, with the goal of learning motion-dynamics-relevant features for HAR.

Methodology

Data and preprocessing: Tri-axial wrist-worn accelerometer data were used, originally sampled at high rates (e.g., ~100 Hz) and resampled to 30 Hz. Signals were segmented into fixed 10-second windows, treated as independent inputs. This windowing and sampling rate ensured coverage of human movement frequencies (<10 Hz) while exceeding the Nyquist rate. For downstream benchmarks, consistent preprocessing was applied to ensure fair comparisons. Datasets: Unlabelled pre-training used the UK Biobank accelerometer dataset (>100,000 participants, 7 days wear per participant; ~700,000 person-days; free-living). Eight external labelled datasets were used for downstream evaluation, spanning 600 to ~600,000 samples, 4–18 activity classes, five device brands, multiple wear locations, varied populations (age, sex, health status), and settings (free-living, semi-free-living, lab). These included Capture-24, Rowlands, WISDM, MJFF-LR, REALWORLD, Opportunity, PAMAP2, and ADL. Where necessary in very small datasets (<10 subjects), classes absent for some subjects were removed to enable subject-wise cross-validation comparability. Self-supervised tasks (pretext tasks): Three binary tasks were used in a multi-task SSL framework: (1) Arrow of Time (AoT): reverse the time axis of the window; (2) Permutation: split the window into four chunks (minimum length ≥10 timestamps) and randomly reorder them; (3) Time Warping (TW): locally stretch/compress segments to simulate slow/fast motion. Each task predicts whether the transformation was applied. Multi-task learning computed cross-entropy losses per task and weighted them equally. Weighted sampling: Free-living data contain extensive low-movement periods that are minimally altered by the transformations and thus less informative. To improve stability and convergence, windows were sampled during SSL in proportion to their standard deviation, up-weighting higher-movement segments. Network architecture and training: A 1D ResNet-V2 with 18 layers (approximately 10 million parameters) served as the shared feature extractor (feature vector size 1024). For SSL, a separate softmax head was attached per pretext task. For downstream HAR, a fully connected layer (size 512) and a softmax readout were appended. At each SSL training iteration, up to four UK Biobank subjects were loaded; for each subject, one day was sampled, from which 1,500 windows (10 s each) were drawn, forming a batch of up to 6,000 windows. Random axis swaps and rotations were applied to embed invariance to device orientation. Optimisation used Adam (learning rate 1e-3) with linear learning-rate scaling for large batches and a five-epoch burn-in. Training was distributed across four NVIDIA Tesla V100-SXM2 GPUs (32 GB), taking ~420 GPU-hours for ~20 epochs. An 8:2 train/test split was used for SSL experiments. Downstream evaluation and baselines: Two fine-tuning strategies were evaluated: (1) fine-tune all layers; (2) freeze the feature extractor and fine-tune only the classifier (FC layers). Baselines included training the same deep network from scratch and a strong random forest using established time-series features. For datasets with <10 subjects, leave-one-subject-out cross-validation was used; for ≥10 subjects, five-fold subject-wise cross-validation was used. Each CV split used 7:1:2 train/validation/test ratios. Early stopping with patience of 5 epochs was applied; for non-convergent runs, the best result across three random seeds was reported. A unified implementation ensured consistent preprocessing, training, and evaluation across datasets. Transfer learning comparisons: In addition to SSL pre-training on UK Biobank, supervised pre-training on the two largest labelled datasets (Capture-24 and Rowlands) was performed, then fine-tuned on other labelled datasets to compare against SSL pre-training. Ablation studies: Labelled data ablation varied the number of labelled subjects during fine-tuning (Capture-24 and Rowlands) to assess performance in limited-data regimes. Unlabelled data ablation varied the number of pre-training subjects from 100 to 100,000 (log-scale increments), and, with 10,000 pre-training subjects, varied the per-subject data ratio from 0.25 to 1.0 to quantify the impact of unlabelled data volume and density. Representation analysis and explainability: UMAP projections of raw inputs, untrained features, and SSL features (without fine-tuning) were used to visualise clustering/separability across activity classes and intensities. Layer-wise relevance propagation and other XAI methods were applied on transformed vs original signals to identify which signal segments contributed to pretext task predictions, with sample-masking experiments evaluating faithfulness on 1,000 out-of-sample UK Biobank subjects.

Key Findings

- SSL pre-training on UK Biobank improved downstream HAR across all eight benchmark datasets, outperforming both training from scratch and strong random forest baselines. Reported relative F1 improvements (fine-tuning all layers vs training from scratch) ranged from 2.5% to 130.9% (median 24.4%). Median relative F1 improvement vs training from scratch was 18.4%, and 8.5% vs random forest. - Dataset-specific F1 improvements (fine-tuning all layers vs scratch): Capture-24: +2.5%; Rowlands: +14.4%; WISDM: +18.4%; MJFF-LR: +130.9%; REALWORLD: +12.3%; Opportunity: +55.4%; PAMAP2: +30.4%; ADL: +100.0%. - Random forest outperformed deep models trained from scratch on all datasets except the largest labelled set (Capture-24), but SSL pre-trained models surpassed both baselines across all datasets. Fine-tuning all layers consistently outperformed freezing the feature extractor and training only the classifier. - Weighted sampling during SSL substantially improved convergence and accuracy for individual pretext tasks; without it, AoT remained near chance and permutation dropped by ~10 percentage points. - Multi-task SSL configurations produced similar performance on larger downstream datasets (Capture-24, Rowlands) but more variance on a smaller dataset (Opportunity). Training multiple tasks jointly was selected for subsequent experiments to encourage more general representations. - SSL pre-training sometimes outperformed supervised pre-training using labelled source datasets (e.g., with Rowlands or Capture-24 as source), while SSL on the much larger, diverse UK Biobank yielded the best overall transfer performance across targets. - Data scaling effects: Increasing the number of unlabelled subjects in pre-training improved downstream F1 approximately linearly on a log scale, with the strongest gains on the smallest dataset (Opportunity). Reducing per-subject unlabelled data from 100% to 25% (with 10,000 subjects) decreased downstream F1 by less than 10%. - Labelled data effects: SSL pre-trained models maintained strong performance even with fewer labelled subjects, whereas fully supervised deep models and random forests were more sensitive to labelled sample size; F1 gains were roughly linear with added subjects, with larger increments at lower subject counts. - Representation analysis showed SSL features had superior intra-class compactness and inter-class separability (e.g., distinguishing walking/stairs vs sitting/typing; activity intensity gradients). XAI analyses indicated relevance aligned with natural motion dynamics (e.g., swings in tennis), while deemphasising stationary periods and flagging unrealistic augmented dynamics.

Discussion

The study demonstrates that large-scale SSL on free-living wrist accelerometer data learns generalisable, transferable representations that consistently improve HAR across diverse external datasets, devices, environments, and populations. This addresses the central challenge of limited labelled data by enabling strong performance with fewer labels, especially valuable in clinical and population health contexts where annotation is costly. Compared to prior approaches trained on small or homogeneous datasets, pre-training on UK Biobank’s vast and diverse free-living data yields broader coverage of real-world activities and better robustness to domain and task shifts. Notably, SSL pre-training can match or surpass supervised pre-training, echoing trends observed in other modalities, and suggests that representation quality benefits more from data diversity and scale than from labels alone in this setting. The findings support the concept of a foundational HAR model that can be fine-tuned to varied downstream tasks, offering a practical and reproducible pathway to state-of-the-art performance. Improved activity measurement can enhance clinical monitoring, epidemiology, and personalised health applications.

Conclusion

The authors developed and released a self-supervised, ResNet-based foundational model for HAR by pre-training on ~700,000 person-days of UK Biobank free-living accelerometer data. The pre-trained model consistently outperforms strong baselines across eight external datasets, with the largest gains in small labelled-data regimes, and generalises across domains, devices, and populations. The unified evaluation framework and open-sourced models enable reproducible, high-performing HAR systems for diverse applications. Future work should expand pre-training data diversity across regions and demographics, incorporate multimodal wearable signals (e.g., ECG), explore newer SSL paradigms and architectures, and investigate compute-optimal trade-offs between model size and data scale.

Limitations

- Pre-training data primarily comprise participants from the UK (largely Caucasian), limiting demographic and geographic diversity; generalisability to under-represented populations requires further validation and more diverse pre-training corpora. - Current work focuses on unimodal accelerometer data; incorporating additional wearable modalities (e.g., ECG and other biosensors) could improve representation breadth. - Alternative SSL methods (e.g., autoencoders, contrastive learning) did not yield high-quality representations on this free-living dataset in preliminary attempts, potentially due to differences between free-living and lab-based dynamics; further comparative analyses are warranted. - Availability and governance of open benchmark datasets vary; some lacked clear licensing/consent information, posing challenges for reproducibility and data governance in future studies.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Health and Fitness

Self-supervised learning of accelerometer data provides new insights for sleep and its association with mortality

H. Yuan, T. Plekhanova, et al.

Earth Sciences

Global high-resolution total water storage anomalies from self-supervised data assimilation using deep learning algorithms

J. Gou and B. Soja

Computer Science

Self supervised learning based emotion recognition using physiological signals

M. Zhang and Y. Cui

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny