logo
ResearchBunny Logo
A method for intelligent allocation of diagnostic testing by leveraging data from commercial wearable devices: a case study on COVID-19

Medicine and Health

A method for intelligent allocation of diagnostic testing by leveraging data from commercial wearable devices: a case study on COVID-19

M. M. H. Shandhi, P. J. Cho, et al.

This study presents an innovative Intelligent Testing Allocation method to enhance the efficiency of diagnostic testing during disease outbreaks. By analyzing data from over 15,000 participants, including smartwatch metrics, the researchers revealed that resting heart rate is a more sensitive early indicator of COVID-19 than step count. The findings, attributed to the collaborative work of Md Mobashir Hasan Shandhi and colleagues, suggest that deploying ITA could significantly alleviate testing resource shortages.

00:00
00:00
~3 min • Beginner • English
Introduction
The COVID-19 pandemic has severely impacted the world, with more than 562 million cases and 6.37 million deaths worldwide and more than 90 million cases and 1 million deaths in the US at the time of writing. Mass surveillance testing is a key tool for monitoring spread, but cost, availability, and practicality impede frequent widespread testing. Test shortages have recurred with new variants (Delta, Omicron) and emerging diseases (e.g., monkeypox), and access disparities exacerbate inequities, particularly in rural areas. Insufficient testing leads to underreporting and misinformed public health responses, underscoring the need to improve surveillance testing. Targeting testing toward individuals more likely to be infected can increase positivity rates and reduce the number of tests needed to capture positives. This study develops and evaluates an Intelligent Testing Allocation (ITA) method leveraging wearable device data and surveys to prioritize diagnostic testing in resource-limited settings.
Literature Review
The paper situates ITA within prior work on using wearables for infectious disease detection and monitoring. Previous studies have shown changes in resting heart rate, steps, and respiratory rate around COVID-19 infection and explored passive detection using smartwatch data (e.g., Mishra et al., Que et al., Miller et al., Grzędzicka et al., Jain et al.). Most prior evaluations emphasized AUC-ROC despite class imbalance; this work argues AUC-PR is more appropriate for imbalanced datasets. The study also references public health literature on testing shortages, demand surges during variant waves, and disparities in access, as well as methodological guidance on model selection and overfitting.
Methodology
Study platforms and cohorts: The team developed the CovidVerify/Covidently platform in April 2020 to collect electronic symptom surveys and integrate commercial wearable data (Fitbit, Garmin, Apple Watch). Data from two cohorts were used: a Duke-led CovidInMyLife/Covidently cohort and the Stanford MyPHD/MyHD cohort. A total of 7348 participants e-consented; 6765 enrolled by completing an onboarding survey. Across studies, 1265 individuals contributed smartwatch data; subsets with sufficient wearable data were used for modeling. Participants reported diagnostic test results (positive and negative). Data were stored securely and transferred to a protected environment. Wearable data processing: For high-frequency devices, daily resting heart rate (RHR) was computed as the mean heart rate between midnight and 7 AM during epochs with zero steps, using device-specific data sufficiency thresholds (e.g., Fitbit ≥2630 HR points; Apple Watch ≥19; Garmin ≥1389 between midnight and 7 AM). Daily step counts were summed over 24 hours. Only days with both HR and steps were included. Inclusion criteria required sufficient data availability in baseline and detection periods: at least 50% of days with data in both baseline and detection windows (with device-specific thresholds). Sleep was defined as inactivity epochs between midnight and 7 AM. Time windows: The baseline period was 20–60 days prior to the diagnostic test date. The detection period extended up to 21 days prior to the diagnostic test date (chosen to cover the 2–14 day incubation period and delays to testing). Grid search explored different detection window end dates and lengths to identify optimal feature extraction windows; optimal detection end date was 1 day prior to test with 5-day window length. Cohorts for modeling: After applying data thresholds, cohorts were defined: All-Frequency (AF) n=520 (63 COVID-19 positive, 457 negative), All-High-Frequency (AHF) n=469 (54 positive, 415 negative), and Fitbit-High-Frequency (FHF) n=280 (40 positive, 240 negative). Training and independent test sets were split per cohort (e.g., AF total 520 with 105 in test; AHF total 469 with 97 in test; FHF total 280 with 63 in test; test-set positives were small, e.g., 7 in FHF training description). Some positives were clinically confirmed; others were self-reported. Feature engineering: Deviation metrics were computed to quantify changes from baseline to detection: Delta (DB_detection - baseline median), Delta_Normalized (Delta / baseline median), Delta_Standardized ((DB_detection - baseline median) / baseline IQR), and Z-score ((DB_detection - baseline mean) / baseline SD). Summary statistics (average, median, max, min, range) of daily deviations within the detection window were extracted for RHR and steps, creating features. Statistical testing (unpaired t-tests with Benjamini–Hochberg correction) identified features significantly different between positive and negative groups across detection window combinations; intersections of significant features across windows (e.g., 5–1 days and 3–1 days before test) were used as model inputs. Across cohorts, 28–31 features overlapped as significant; top-ranked features tended to be RHR-based. Model development: Five classifiers were trained: logistic regression (LR), k-nearest neighbors, support vector machine, random forest, and gradient boosting. Nested stratified cross-validation (outer 10-fold for model selection; inner 5-fold grid search for hyperparameters) was employed on training data. Recall was used as the model selection metric to prioritize minimizing false negatives. Performance metrics reported were AUC-ROC and AUC-PR (the latter emphasized due to class imbalance). Feature importance for LR was assessed via absolute coefficients from the optimized model trained on the full training set. Evaluation: Models were evaluated via nested CV on training sets and on independent test sets per cohort. Additionally, ablation models were built using only steps-based features or only RHR-based features to assess relative contributions. The ITA framework ranks individuals by predicted probability to simulate allocation of a limited number of tests, comparing positivity rates under Intelligent Testing Allocation (ITA) against a Random Testing Allocation (RTA) null model across testing capacities (e.g., 2.5–30% of population). Symptomatic vs asymptomatic performance was examined using self-reported symptoms during detection or on the test date.
Key Findings
- Physiological and behavioral changes: COVID-19-positive individuals showed significant increases in daily RHR during the detection period relative to baseline (mean difference 16.5 ± 4.63 bpm; n=117; p<0.001) and significant decreases in step counts (−854.2 ± 386.5 steps/day; n=125; p<0.001). Deviations from baseline appeared as early as ~13 days prior to the positive test for RHR and ~10 days prior for steps. - Temporal signal: RHR features became significant earlier than steps features in the detection period, indicating physiological changes precede activity changes during infection. - Feature contributions: Across cohorts, 28–31 features were consistently significant; top 7–9 were RHR-based. Median and mean ΔSteps were consistently among the most significant steps features; top RHR features included maximum ΔRHR and mean/median RHR Z-scores. - Model performance: Logistic regression outperformed more complex models in cross-validated training across AF, AHF, and FHF, suggesting simpler, explainable models generalize better for this task. - AUC metrics (FHF cohort, LR): Cross-validated AUC-ROC 0.73 ± 0.12 and AUC-PR 0.55 ± 0.21; independent test AUC-ROC 0.77 and AUC-PR 0.24. - Steps vs RHR ablations: Steps-only models had higher AUC-ROC than RHR-only, but RHR-only had higher AUC-PR than steps-only. Adding steps to RHR increased AUC-ROC by ~7–11%, while adding RHR to steps improved AUC-PR by ~38–50% (training); similar trends held on test (steps exclusion reduced AUC-ROC by 5–11%, RHR exclusion reduced AUC-PR by 5–25%). Interpretation: steps improved specificity (ROC), RHR improved precision (PR). - ITA vs RTA positivity rates: Baseline RTA positivity ~11–12%. Under limited testing capacity (10–30% of population), ITA increased positivity 2–3× in training (21–36%) and 1.5–2.5× in testing (19–29%). In FHF training, with 2.5–5% capacity, ITA achieved up to 6.5× increase (approaching 100% positivity). On independent tests, AF and FHF cohorts achieved 1.5–4× increases (e.g., 17–32% vs 11% RTA) depending on capacity. - Symptom status: ITA identified both symptomatic and asymptomatic positives. With 30% capacity, ITA testing included 27% asymptomatic positives in cross-validated training and 17% asymptomatic positives in the independent test subset, indicating utility without relying on self-reported symptoms.
Discussion
The study demonstrates that passively collected wearable-derived digital biomarkers can prioritize diagnostic testing to improve efficiency under resource constraints. Physiological changes (RHR) precede behavioral changes (steps), and combining these signals enhances prediction depending on the metric: steps improve trade-off between true and false positives (ROC), whereas RHR improves precision in imbalanced settings (PR). Logistic regression’s superior performance suggests that simpler, explainable models are suitable and may translate more readily into practice. By ranking individuals for limited testing slots, ITA increases positivity rates relative to random allocation, potentially capturing the same number of positives with fewer tests or more positives with the same number of tests. Importantly, ITA identified asymptomatic infections, expanding its utility beyond symptom-based screening. These findings support deploying ITA-like systems to improve surveillance in workplaces, schools, and communities, especially during surges and supply shortages.
Conclusion
The paper introduces and validates an Intelligent Testing Allocation framework that leverages wearable-derived RHR and step data to prioritize diagnostic testing for COVID-19. ITA increases testing positivity rates by 1.5–6.5× under varying capacities, with robust cross-validated performance and independent test validation. RHR signals occur earlier than activity changes, and the combined use of features yields complementary gains across ROC and PR trade-offs. The approach can extend to other infectious diseases and enhance pandemic preparedness. Future work should evaluate real-time deployment, explore methods to address class imbalance, integrate more advanced AI (e.g., reinforcement learning, graph methods), and assess effectiveness across diverse devices and populations, including implementation studies comparing ITA-guided testing to standard surveillance.
Limitations
- Real-time deployment not performed; practical performance remains to be tested. - Bring-your-own-device design introduces selection bias (affordability, access) and device heterogeneity (sampling rates, definitions of RHR/steps), adding noise. - Missing data due to non-wear and charging (e.g., Apple Watch) can confound estimates; stringent inclusion criteria reduced available data. - Reliance on self-reported diagnostic results and symptoms for some participants; potential misclassification. - Class imbalance (12–15% positives) limits PR performance; resampling was not effective due to overlapping feature distributions. - Exact infection exposure and symptom onset dates are uncertain in real-world data; timelines are approximate. - Only a subset of the larger cohort had sufficient wearable data, potentially limiting generalizability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny