Medicine and Health

Machine learning-based prediction of COVID-19 diagnosis based on symptoms

Y. Zoabi, S. Deri-rozov, et al.

Discover a groundbreaking machine-learning model that predicts COVID-19 infection with remarkable accuracy using easily accessible features. This innovative research, conducted by Yazeed Zoabi, Shira Deri-Rozov, and Noam Shomron from Tel Aviv University, aims to enhance COVID-19 testing efficiency, especially in resource-limited settings.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the need for rapid, effective screening of COVID-19 during a global pandemic that strained healthcare systems and limited access to RT-PCR testing, particularly in resource-constrained settings. While prior prediction models incorporated CT scans, lab tests, and clinical symptoms, many relied on hospitalized cohorts and were less applicable to general population screening. Using nationwide data from the Israeli Ministry of Health (MoH) encompassing all RT-PCR–tested individuals during the early pandemic, the authors sought to develop and prospectively evaluate a simple machine-learning model that predicts SARS-CoV-2 positivity using eight binary, easily obtained features. The aim was to aid triage and prioritize testing, mitigating referral bias present in earlier studies and improving public health response where testing resources are limited.

Literature Review

Prior work developed models using imaging (CT), clinical symptoms, laboratory tests, or their integration to predict COVID-19, but most were derived from hospitalized patients, limiting generalizability to community screening. The paper references dashboards tracking COVID-19 in real time, AI models for CT-based diagnosis, and studies on symptom associations with test positivity in healthcare workers. A Cochrane review highlighted limitations in relying solely on signs and symptoms in primary care/outpatient settings. These gaps motivate a population-level, symptom-based, low-resource model for broader screening.

Methodology

Data source: Publicly released Israeli Ministry of Health dataset of all residents tested for SARS-CoV-2 via RT-PCR on nasopharyngeal swabs. Records included test date and result, sex, a binary indicator for age ≥60, self-reported initial clinical symptoms, and whether the individual had known contact with a confirmed case. Cohorts: Training-validation set included 51,831 individuals (4,769 positives) from March 22–31, 2020. Prospective test set included 47,401 individuals (3,624 positives) from April 1–7, 2020. The training-validation set was split 4:1 for model training and internal validation. Features (all binary): sex (male/female), age ≥60 (true/false), cough, fever, sore throat, shortness of breath, headache (true/false), and known contact with a confirmed COVID-19 case (true/false). Model: Gradient-boosting decision-tree model implemented with LightGBM. Missing values were handled inherently by the model. Early stopping used the validation set with auROC as the optimization metric. Interpretability: Feature contributions were assessed via SHAP (Shapley Additive Explanations) values to identify principal drivers of predictions. Evaluation: Prospective evaluation on the held-out test week. Primary metric was auROC; precision-recall curves (PPV vs sensitivity) were generated. Performance metrics (sensitivity, specificity, PPV, NPV, FPR, FNR, FDR, accuracy) and 95% confidence intervals were estimated via bootstrap (1,000 repetitions). Bias mitigation experiment: Recognizing biased symptom reporting (more comprehensive among RT-PCR positives), the authors identified highly biased symptoms (headache, sore throat, shortness of breath) versus more balanced symptoms (cough, fever). A secondary model was trained/tested excluding highly biased symptoms to assess robustness. Additionally, simulated missingness experiments randomly blanked 10% and 20% of negative symptom reports across all five symptoms in the test set to evaluate performance under increased missing data. Data/code availability: Dataset translated and available at https://github.com/nshomron/covidpred; code and hyperparameters provided at the same repository. Ethics: IRB exempt per Tel-Aviv University review board for use of public MoH data.

Key Findings

- Prospective test performance: auROC 0.90 (95% CI: 0.892–0.905). Example operating points: (a) sensitivity 87.30% with specificity 71.98%; (b) sensitivity 85.76% with specificity 79.18%. - Precision-recall: auPRC 0.66 (95% CI: 0.647–0.678). - Feature importance (SHAP): Fever and cough were key predictors; known contact with a confirmed case was highly informative; male sex and age ≥60 also contributed to higher predicted risk. - Robustness to reporting bias: Training/testing after excluding highly biased symptoms (headache, sore throat, shortness of breath) yielded auROC 0.862 (95% CI: 0.855–0.870) with a slightly altered SHAP profile. - Simulated missingness: When randomly substituting 10% and 20% of negative symptom reports with missing values across all five symptoms in the test set, ROC performance remained strong, indicating robustness to incomplete symptom reporting. - Dataset scale and composition: Training-validation set n=51,831 (4,769 positives); prospective test set n=47,401 (3,624 positives). Descriptive statistics showed higher prevalence among positives for cough (48.2%), fever (44.5%), and known contact (48.2%); a higher proportion of males among positives (57.2% vs 50.1% in negatives) and more age ≥60 (19.7% vs 14.9%).

Discussion

The model addresses the need for rapid, scalable screening by leveraging eight simple, binary features that can be obtained through brief questioning, enabling triage and prioritization of RT-PCR testing when resources are constrained. Findings confirm the central role of fever, cough, and known exposure in predicting positivity, aligning with known transmissibility and symptom profiles, and corroborate observed sex differences and age-related risk. Despite biases in self-reported symptoms and missing data, the model maintained high discrimination in prospective testing and in stress tests removing biased symptoms or simulating increased missingness. These results suggest applicability to public health screening and resource allocation in general populations, including in low- and middle-income settings, and provide a transparent approach with interpretable feature contributions via SHAP. Continued data sharing and integration of additional relevant symptoms or risk factors could further enhance performance as understanding of COVID-19 evolves.

Conclusion

Using nationwide Israeli MoH data, the authors developed and prospectively validated a simple, interpretable machine-learning model that predicts SARS-CoV-2 RT-PCR positivity from eight binary, easily obtainable features. The model achieved high accuracy and demonstrated robustness to reporting biases and missing data, supporting its utility for screening and prioritizing testing where resources are limited. Future work should incorporate additional predictive symptoms (e.g., anosmia/ageusia) and richer exposure details, validate across diverse populations and time periods, and adapt the framework for future respiratory epidemics.

Limitations

- Reliance on self-reported symptoms introduces reporting bias and potential mislabeling, particularly among RT-PCR–negative individuals. - Limited granularity for exposure (contact) information (e.g., duration, indoor/outdoor context) not available. - Some highly predictive symptoms reported in other studies (e.g., loss of smell/taste) were not collected in the dataset. - Negative entries for symptoms may reflect unreported values rather than true absence; although simulated missingness analyses suggest robustness, real-world missingness patterns may differ. - Potential biases in symptom reporting differ between positives and negatives; while most individuals had testing indications (reducing referral bias), residual biases cannot be fully excluded.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms

P. Amiri, M. Montazeri, et al.

Medicine and Health

Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection

Y. Takahashi, M. Ueki, et al.

Health and Fitness

Prediction of the COVID-19 outbreak in China based on a new stochastic dynamic model

Y. Zhang, C. You, et al.

Medicine and Health

Detection of senescence using machine learning algorithms based on nuclear features

I. Duran, J. Pombo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny