logo
ResearchBunny Logo
Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19

Medicine and Health

Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19

S. Subudhi, A. Verma, et al.

This intriguing study dives into predicting COVID-19 outcomes using 18 machine learning algorithms. Discover how ensemble models outshine others, revealing key biomarkers for ICU admission and mortality, all thanks to the research conducted by Sonu Subudhi, Ashish Verma, Ankit B. Patel, C. Corey Hardin, Melin J. Khandekar, Hang Lee, Dustin McEvoy, Triantafyllos Stylianopoulos, Lance L. Munn, Sayon Dutta, and Rakesh K. Jain.

00:00
00:00
~3 min • Beginner • English
Introduction
The COVID-19 pandemic created significant strain on healthcare resources, making early identification of patients at risk for ICU admission and mortality critical for triage and treatment decisions. Machine learning (ML) has been used in clinical prediction for conditions like acute kidney injury and septic shock, but few studies have systematically compared ML algorithms for predicting ICU admission and mortality in COVID-19. This study aims to evaluate and compare 18 ML algorithms using data available at initial ED presentation to predict ICU admission within 5 days and mortality within 28 days among COVID-19 patients in a multi-hospital system. The objective is to identify the best-performing algorithms and the most important predictive variables, and to validate model performance on a temporally distinct cohort.
Literature Review
Prior work has applied ML to predict outcomes in hospitalized and outpatient settings, including COVID-19 severity and respiratory decompensation, but most studies focused on limited algorithm sets or phenotype identification rather than a comprehensive algorithm comparison. Earlier COVID-19 studies identified biomarkers (e.g., procalcitonin, renal dysfunction) associated with severity and mortality, but did not systematically benchmark diverse ML approaches. This study addresses that gap by quantitatively comparing 18 algorithms across categories (ensemble, linear, SVM, tree-based, naïve Bayes, Gaussian process, nearest neighbor, discriminant analysis, neural network) for ICU admission and mortality prediction and by interpreting model features using SHAP.
Methodology
Study design: Retrospective multi-hospital cohort from the Mass General Brigham (MGB) system. Inclusion: Patients positive for SARS-CoV-2 with an ED encounter. Development cohort: March–April 2020; Temporal validation cohort: May–August 2020. Outcomes: ICU admission within 5 days of ED arrival; mortality within 28 days of ED arrival. Data sources: EHR-derived demographics, comorbidities (Charlson index), home medications (e.g., ACEI/ARB), vital signs, first laboratory values after ED admission, admitting service, ventilator use (included only in mortality models), and other clinical variables. Variable handling: Covariates included age (10-year intervals), race (African American or other), BMI, modified Charlson Comorbidity Index, ACEI/ARB use, hypertension (>140/90 mmHg), and eGFR <60 ml/min as categorical features; lab and physiologic measures as continuous features. Data cleaning: Patients with missing dependent variables (outcomes) or obviously incorrect entries were excluded. Missing predictor values were imputed using k-nearest neighbor imputation. Class imbalance: Majority class was randomly undersampled to create balanced training datasets; excluded and included majority-class patients were compared to assess bias (no significant differences). Model development: Eighteen algorithms from nine categories were evaluated. Hyperparameter tuning used grid search (GridSearchCV) with 5-fold stratified cross-validation; F1 score was the primary selection metric due to class imbalance. Performance metrics: ROC AUC, PR AUC, F1, recall, precision, balanced accuracy, and Brier score; 95% CIs via t-distribution (n=5 folds). Temporal validation: Best-tuned models were evaluated on the temporally distinct May–August 2020 cohort (imbalanced). Model interpretation: SHAP (TreeSHAP) analysis performed for RandomForest models to assess variable importance and directionality of effects. Cohorts: Development dataset n=3597 (48.7% women; median age 55); ICU admission within 5 days n=486 (13.5%); 28-day mortality n=344 (9.6%). Temporal validation dataset n=1711; ICU admissions n=146 (8.5%); deaths n=78 (4.5%). Group comparisons showed similar distributions of age, gender, and race; BMI differed significantly.
Key Findings
- Cross-validation performance for ICU admission: Ensemble models achieved mean F1 ≥0.80. Specific F1 (95% CI): AdaBoost 0.80 (0.75–0.85), Bagging 0.81 (0.77–0.85), Gradient Boosting 0.81 (0.77–0.85), Random Forest 0.81 (0.78–0.84), XGBoost 0.80 (0.76–0.84), Extra Trees 0.80 (0.76–0.84). Logistic Regression 0.77 (0.73–0.81), Decision Tree 0.78 (0.76–0.80), LDA 0.77 (0.72–0.82), QDA 0.79 (0.78–0.80), MLP 0.77 (0.74–0.80). PassiveAggressive, Perceptron, LinearSVC performed worse. Ensemble and tree-based categories had significantly higher F1 than most others. - Cross-validation performance for mortality: Ensemble models had mean F1 ≥0.83. AdaBoost 0.84 (0.81–0.87), Bagging 0.83 (0.80–0.86), Gradient Boosting 0.86 (0.84–0.88), Random Forest 0.87 (0.85–0.89), XGBoost 0.85 (0.84–0.86), Extra Trees 0.87 (0.85–0.89). LDA 0.88 (0.86–0.90), QDA 0.81 (0.77–0.85), GaussianNB 0.82 (0.78–0.86), Decision Tree 0.82 (0.77–0.87). Logistic Regression lower at 0.76 (0.70–0.82); lowest F1 for PassiveAggressive, Perceptron, LinearSVC, KNN. - Brier scores: Ensemble methods showed lower Brier scores (more accurate probability estimates): ICU models mean Brier <0.15 (except AdaBoost); mortality models mean Brier <0.12 (except AdaBoost). - Temporal validation: All models’ performance decreased on the temporally distinct cohort, but ensemble-based methods remained best for both ICU admission and mortality in terms of F1 and PR AUC; logistic regression and LDA had comparable F1 for ICU but lower PR AUC than ensembles. Calibration plots indicated general overestimation of risk; ensemble methods (e.g., Random Forest) were closer to true risk than others. - Variable importance (SHAP, Random Forest): ICU admission—higher CRP, LDH, neutrophil percentage, and respiratory rate increased risk; lower oxygen saturation and lower lymphocyte percentage associated with higher risk. Mortality—ventilator use, eGFR <60 ml/min/1.73 m², higher neutrophil percentage, higher serum potassium, lower lymphocyte percentage, and higher procalcitonin increased risk; respiratory rate and serum anion gap were also important. In temporal validation, ICU predictors remained similar; for mortality, D-dimer and initial oxygen saturation gained importance, while ventilator use decreased in importance.
Discussion
This study systematically compared 18 machine learning algorithms for predicting ICU admission and 28-day mortality using data available at ED presentation for COVID-19 patients. Ensemble-based methods consistently outperformed most other algorithm classes in cross-validation and maintained superior relative performance in temporal validation despite an overall performance drop. Feature importance analyses aligned with clinical knowledge—markers of inflammation (CRP, procalcitonin), tissue injury (LDH), hypoxemia (oxygen saturation, respiratory rate), immune response (neutrophil/lymphocyte percentages), and renal dysfunction (eGFR <60 ml/min/1.73 m²) were key predictors. Temporal shifts in mortality feature importance (e.g., increased roles of D-dimer and initial oxygen saturation, decreased role of ventilator use) may reflect evolving clinical management, treatments, and disease dynamics over time. The findings address the research question by identifying ensemble methods as the most effective approach within this setting and by elucidating clinically meaningful predictors to inform decision-making.
Conclusion
A comprehensive benchmarking of 18 ML algorithms showed that ensemble-based methods are most effective for predicting 5-day ICU admission and 28-day mortality in COVID-19 patients at ED presentation. The study identified key predictors—kidney function, lymphocyte and neutrophil percentages, CRP, LDH, oxygen saturation, respiratory rate, procalcitonin, and serum potassium—and demonstrated temporal shifts in mortality predictors. These models can support triage and treatment decisions, but further external validation in broader, multi-regional cohorts and prospective evaluation are needed. Future research should incorporate time-series data, address calibration uniformly, and explore strategies to mitigate performance degradation over time.
Limitations
- Heterogeneous model calibration: some algorithms do not output probabilities, limiting uniform calibration; Brier score used as an aggregate error metric but does not ensure good calibration. - Primary selection metric was F1 (beta=1), which can be less intuitive; alternative F-beta choices may be preferable depending on clinical priorities. - Missing predictor data imputed via k-nearest neighbors, which can introduce data distortion despite advantages over simpler methods. - SHAP (TreeSHAP) interpretation requires algorithm-specific adaptations; computational considerations and generalizability of importance across methods may vary. - Some laboratory results may be unavailable during the ED encounter, limiting real-time triage utility. - Lack of pre-ED time-course features limited assessment of symptom trajectory and dynamic risk. - Temporal validation showed performance decline, potentially due to changes in management, evolving disease, and imbalanced outcome prevalence. - Training used random undersampling of the majority class; while efforts were made to assess bias, undersampling can affect representativeness. - Single health system in Southern New England (including two tertiary centers) limits generalizability; resource availability may differ elsewhere. - BMI distribution differed between cohorts; residual confounding may persist.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny