Medicine and Health

Integrating AI/ML Models for Patient Stratification Leveraging Omics Dataset and Clinical Biomarkers from COVID-19 Patients: A Promising Approach to Personalized Medicine

B. Bello, Y. N. Bundey, et al.

This study conducted by Babatunde Bello, Yogesh N Bundey, Roshan Bhave, Maksim Khotimchenko, Szczepan W Baran, Kaushik Chakravarty, and Jyotika Varshney explores how AI and machine learning can stratify COVID-19 patients. By analyzing omics data and clinical biomarkers, high-accuracy models were developed to predict severity and survival, revealing essential biomarkers linked to severe cases and survival rates. This research emphasizes the transformative potential of personalized medicine in combating COVID-19 and other viral infections.... show more

Introduction

The pandemic surge of COVID-19 patients, heterogeneous clinical presentations, and variable outcomes highlighted the need for accurate prognostication to prevent healthcare overload and guide treatment. Disease severity spans asymptomatic to critical illness and can culminate in multi-organ failure. Variants and long COVID added complexity to prognosis. Existing ML models using electronic health record features demonstrate promise for predicting severity, progression, and mortality; however, most lack incorporation of patient genomics, which likely contributes to variability in disease severity. The study’s objective was to identify the most impactful clinical biomarkers and linked gene networks that contribute to severe COVID-19 and reduced survival, using ML for prediction and explainability, and weighted gene co-expression analysis to connect clinical biomarkers with underlying molecular pathways.

Literature Review

Prior work applied gradient boosting trees and other ML methods to predict COVID-19 progression, recovery, and mortality from demographics and routine labs, and deep learning survival models using baseline clinical features have shown good performance. However, these approaches generally did not integrate genomics, limiting insights into biological pathways driving severity. Other studies have used SHAP for explainability in mortality prediction but with fewer variables than the current analysis. Literature supports the prognostic value of SOFA score, LDH, BUN/creatinine, and BUN/albumin ratios for severity and mortality. The paper positions its contribution as integrating clinical biomarkers with transcriptomics via ML and gene network analysis to provide predictive performance with explainability and biological interpretation.

Methodology

Data sources: Clinical patient-level datasets were identified via systematic searches of PubMed/MEDLINE and Scopus; 22 primary-source studies were curated (Supplementary Data S1). Clinical data (Synapse ID: syn35874390) comprised 581 unique COVID-19-positive patients with 7707 time-stamped data points, including demographics, comorbidities, medications, blood cell counts, biochemical/inflammatory biomarkers, and SOFA scores. RNA-seq whole blood transcriptomes were obtained from GEO (GSE215865), with 1198 samples retained after filtering for metadata completeness.

Clinical data processing and feature engineering: Columns were grouped into descriptor sets; features were normalized and typed (numerical/categorical). Missing data were summarized (Supplementary Table S2). Multiple Imputation by Chained Equations (MICE) was applied for some analyses. Feature selection used automated statistical learning, Recursive Feature Elimination, and Boruta to identify predictive subsets.

Predictive modeling: Two classification tasks were developed: (1) severity (severe vs moderate) and (2) survival (survived vs non-survived). Algorithms included Bayesian ridge, SVM, Random Forest, LightGBM, XGBoost, CatBoost, and multilayer perceptrons. The dataset was split 80/20 into train/test; training used 10-fold stratified cross-validation for hyperparameter tuning. Evaluation metrics were balanced accuracy and ROC-AUC on held-out test sets. For boosted tree models, metrics were reported with and without MICE, as these models can handle missingness internally. An additional experiment trained models on a training subset excluding patients with comorbidities to evaluate generalizability to mixed-comorbidity test sets.

Model explainability and patient stratification: SHAP was used to quantify feature impact and directionality on predictions. Top-20 influential features were identified by mean absolute SHAP values for both outcomes. Biomarker value ranges associated with high feature impact were derived from 5th–95th quantiles within high-severity or non-survival subsets. K-means clustering (optimal k by elbow method) on SHAP impact profiles was performed to explore patient stratification patterns and biomarker-driven clusters.

Omics analysis: Differentially expressed genes between COVID-19-positive and controls were identified using limma (p<0.05 and |logFC|≥1). WGCNA was applied to construct weighted gene co-expression networks from DEGs, define adjacency and TOM, and identify modules (min size 30), merging by eigengene similarity. Module-trait correlations were computed against severity, end of organ damage (EOD), comorbidities, and clinical biomarkers (BUN, creatinine, D-dimer, CK, LDH, SOFA). Gene Significance (GS) and Module Membership (MM) were computed; key genes were filtered at GS>0.2 and MM>0.8 (and reported examples at GS>0.3 and MM>0.8). Functional enrichment used GO and KEGG via ShinyGO.

Key Findings

Predictive performance (full training): Severity model test balanced accuracy 91.6%, ROC-AUC 98.1%; Survival model test balanced accuracy 99.1%, ROC-AUC 99.9% (both best models LightGBM). Models were robust to missing data due to tree-based handling, supporting clinical deployment without complete biomarker panels.
Predictive performance (training without comorbidities): Severity ROC-AUC 93.5%, balanced accuracy 85.4%; Survival ROC-AUC 87.8%, balanced accuracy 69.8%. Severity prediction generalized reasonably without comorbidity features, whereas survival prediction degraded, indicating survival is more comorbidity-dependent.
SHAP explainability highlighted key clinical drivers:
- Severity: Higher SOFA score, elevated LDH, higher BUN (especially with lower creatinine or albumin), body weight/BMI/age ranges associated with severe cases.
- Survival: Comorbidities (coronary artery disease, diabetes) strongly associated with lower survival; SOFA, LDH, BUN also influential. Methylprednisolone use associated with severe cases and lower survival in the dataset, without causal inference.
- Biomarker ranges with high impact aligned with literature: SOFA ~4–14 (severity) and ~3–14 (mortality); LDH impactful around >400 U/L; high BUN with lower creatinine and/or albumin increasing risk; elevated BUN/creatinine and BUN/albumin ratios indicative of severity and mortality.
Patient clustering on SHAP profiles revealed biomarker-driven subgroups; one cluster (Cluster 5) had 93.1% severe cases without prominent single-biomarker deviations, suggesting multi-feature interactions.
WGCNA modules:
- MEcyan positively correlated with severity, EOD, and clinical biomarkers (BUN, creatinine, D-dimer, CK, LDH, SOFA); enriched for inflammatory/immune pathways (TNF, cytokine signaling, Toll-like receptor, IL-17). Genes include interleukin receptors (IL8RBP, IL10RB, IL17RA) and TLRs (TLR1/2/4/5/6/8) with MYD88, consistent with hyperinflammatory responses and cytokine storm mechanisms.
- MEdarkred showed moderate associations with comorbidities (CKD, heart failure, liver disease) and enrichment for platelet activation, coagulation, wound healing, hemostasis, and cell motility; exhibited negative correlations with EOD, BUN, SOFA but positive with LDH. High-GS/MM genes linked to clinical biomarkers included P2Y12, ECE1, MSANTD3-TMEFF1, PLEKHA8P1, NUTF2, SAV1, CXCR2P1, MSANTD3; additional notable markers include ADAM9, KIF1B, SLC22A4, MAPK14, MAP2K6, SLC2A3, IL17RA.
Overall, clinical features alone were highly predictive; integration with omics revealed biological pathways underpinning the most impactful biomarkers.

Discussion

High AUCs for both severity and survival confirm that routinely available clinical features can accurately stratify COVID-19 patients, enabling risk-based triage and management even with incomplete lab panels. SHAP explainability clarifies which biomarkers and value ranges drive predictions, offering actionable guidance (e.g., prioritizing patients with elevated SOFA, LDH, and BUN, and those with CAD/diabetes for mortality risk). The diminished survival-model performance without comorbidities underscores the importance of comorbid conditions in outcome risk. By integrating WGCNA with clinical ML, the study links predictive biomarkers to immune and coagulation pathways. MEcyan’s association with inflammatory signaling (TNF, TLRs, IL-17) supports the role of dysregulated innate immunity and cytokine storm in severe disease. MEdarkred’s enrichment in hemostasis and wound healing aligns with thromboinflammatory complications and organ damage seen clinically. These molecular insights support targeted therapeutic strategies such as modulation of cytokine pathways and management of coagulopathy and tissue injury, aiding personalized interventions. Compared with prior studies, this work leverages a larger and more diverse feature set with explicit handling of missingness and model explainability, providing both predictive accuracy and mechanistic interpretation conducive to clinical translation.

Conclusion

The study presents robust AI/ML models that accurately predict COVID-19 severity and survival from clinical biomarkers and demonstrate that clinical features alone can be sufficient for high-accuracy stratification (ROC-AUC 98.1% and 99.9%). SHAP-based explainability identifies key biomarkers (SOFA, LDH, BUN, creatinine, albumin) and comorbidities (coronary artery disease, diabetes) influencing outcomes. Gene co-expression modules link these biomarkers to inflammatory (TNF, TLR, IL-17) and hemostatic pathways, highlighting potential drug targets and diagnostic markers. The integrated framework supports personalized medicine by enabling risk stratification and biologically informed therapeutic prioritization, and is extendable to other viral infections. Future work should include more complete and diverse biomarker panels, explicit ratio features (e.g., BUN/creatinine, BUN/albumin), external validation across institutions, and prospective implementation studies.

Limitations

Sparsity of laboratory biomarker data limited multivariate explainability and may have obscured the impact of features with high missingness (e.g., Hemoglobin A1C, IL-8 >90% missing).
Limited diversity and prevalence of some comorbidities constrained detection of their effects on severity.
Class imbalance (survival) required balanced accuracy metrics and may affect generalizability.
Training on a reduced, comorbidity-free subset (−63% of data) led to expected performance drops, particularly for survival.
Observational associations (e.g., medication use such as methylprednisolone) cannot infer causality.
Integration between clinical and omics datasets is cross-sectional; temporal dynamics were not modeled.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

S. Daga, C. Fallerini, et al.

Medicine and Health

Combining Clinical and Genetic Data to Predict Response to Fingolimod Treatment in Relapsing Remitting Multiple Sclerosis Patients: A Precision Medicine Approach

F. L, C. F, et al.

Health and Fitness

Multidisciplinary approach to COVID-19 risk communication: a framework and tool for individual and regional risk assessment

R. R. Parajuli, B. Mishra, et al.

Medicine and Health

A social networks-driven approach to understand the unique alcohol mixing patterns of tuberculosis patients: reporting methods and findings from a high TB-burden setting

K. Nagarajan, B. Palani, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny