Introduction
The COVID-19 pandemic has significantly strained healthcare systems globally, highlighting the need for accurate patient diagnostics and effective management strategies. The variability in COVID-19 symptoms and disease progression, ranging from asymptomatic to critical illness, poses challenges for treatment selection and prognosis. The emergence of new SARS-CoV-2 variants further complicates the situation, affecting vaccine efficacy and immunity. Existing predictive models for COVID-19 severity and survival often lack genomic information, which is crucial for understanding the diversity of disease severity. This study aims to address this gap by integrating omics datasets and clinical biomarkers into AI/ML models for patient stratification, enhancing our understanding of the biological pathways impacting disease severity and enabling the selection of appropriate treatment options. The objective is to identify the most impactful biomarkers associated with severe COVID-19 and lower survival rates through ML predictive models. By analyzing clinical data including biomarker levels, the researchers sought to understand the influence of these biomarkers on disease severity and survival, and ultimately reveal significant biomarkers for improving patient care.
Literature Review
The researchers conducted a literature review of PubMed/MEDLINE and Scopus databases to gather clinical and omics data for COVID-19 patients. Their search strategy included broad and specific terms related to COVID-19 and biomarkers. Over 300,000 articles were initially identified and then narrowed down to 22 articles providing primary patient data. The review highlights the existing use of AI and ML in COVID-19 research, including predictive models for disease severity and survival, and the application of deep learning combined with EHR datasets for diagnosis. However, a noted limitation of previous models is the exclusion of genomic data, which this study aims to address.
Methodology
The study utilized a dataset of 581 unique patients with 7707 clinical data points, encompassing demographic parameters, comorbidities, blood cell counts, and biochemical and inflammatory biomarkers. The data were obtained from publicly available sources including Synapse and the Gene Expression Omnibus (GEO). Missing values were imputed using Multiple Iteration Chain Estimation (MICE) regression. Two LightGBM classifier models (boosted decision tree architectures) were trained, one for predicting COVID-19 severity and another for predicting survival. Model performance was evaluated using balanced accuracy and ROC-AUC scores. SHAP (SHapley Additive exPlanations) analysis was employed for model explainability, identifying the most impactful biomarkers and their value ranges. Weighted Gene Co-expression Network Analysis (WGCNA) was used on a gene expression dataset of 1198 samples to identify gene modules correlated with COVID-19 severity, comorbidities, and clinical biomarkers. Functional enrichment analysis was performed using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) to understand the biological pathways involved. The study also analyzed a subset of the data excluding patients with comorbidities to assess the influence of comorbidities on the predictions.
Key Findings
The best performing models achieved a balanced accuracy of 91.6% and an ROC-AUC score of 98.1% for severity prediction, and 99.4% and 99.9% for survival prediction, respectively. SHAP analysis identified key biomarkers such as SOFA score, lactate dehydrogenase (LDH), blood urea nitrogen (BUN), serum creatinine, and albumin as highly impactful for both severity and survival. Comorbidities like coronary artery disease and diabetes were strongly associated with lower survival but not significantly with severity. The medication methylprednisolone showed a surprising association with both severe cases and lower survival, requiring further investigation. WGCNA identified two gene modules, MEcyan and MEdarkred, exhibiting strong correlations with COVID-19 severity and clinical biomarkers. MEcyan showed positive correlation with severity and several biomarkers, while MEdarkred displayed correlations with comorbidities. Functional enrichment analysis revealed that MEcyan genes were enriched in pathways related to inflammatory and immune responses (TNF signaling, cytokine signaling, Toll-like receptor signaling, IL-17 signaling), while MEdarkred genes were enriched in pathways associated with platelet activation, blood coagulation, and wound healing. The analysis also identified specific genes within the MEdarkred module (P2Y12, ECE1, MSANTD3-TMEFFI, PLEKHA8P1, NUTF2, SAV1, CXCR2P1, MSANTD3) associated with blood coagulation and wound healing pathways and clinical biomarkers.
Discussion
The high accuracy of the ML models indicates that clinical features are sufficient for predicting COVID-19 severity and survival. However, the model explainability analysis provides further insight into which specific features and their value ranges are most impactful, which is crucial for clinicians. The identification of key biomarkers and gene modules associated with disease severity and survival enhances our understanding of the underlying biological mechanisms of COVID-19. These findings can contribute to the development of more targeted therapies and diagnostic tools. The study's findings are consistent with results from other studies, validating the identified biomarkers and pathways. However, the study notes that the sparsity of laboratory biomarker data may have limited the potential for further granular observations. The lack of diversity in observed comorbidities might also have influenced the observed correlations. The unexpected association between methylprednisolone and severe cases/low survival highlights the need for further investigation. The integration of omics data with clinical biomarkers offers a more comprehensive approach to patient stratification than relying solely on clinical features.
Conclusion
This study demonstrates the effectiveness of an AI/ML-based approach for stratifying COVID-19 patients using clinical biomarkers and omics data, enabling accurate prediction of disease severity and outcomes. The high accuracy achieved (98.1% and 99.9% ROC-AUC for severity and survival models, respectively) underscores the value of this integrated approach. The identified key biomarkers and gene modules provide valuable insights into the disease's pathogenesis and can inform personalized medicine strategies. Future research could focus on validating these findings in larger and more diverse patient cohorts, exploring the potential of these biomarkers as drug targets or diagnostic tools, and extending this approach to other viral infections.
Limitations
The study acknowledges limitations related to data sparsity in lab biomarker data, particularly for some biomarkers (Hemoglobin A1C and Interleukin 8) which had over 90% missing values. This might have influenced the model's predictive ability. The limited diversity in observed comorbidities in the dataset could have impacted the analysis of comorbidities' contributions to COVID-19 severity and survival. The study's findings are based on a specific dataset and may not be generalizable to all populations. Further research is needed to validate these findings in broader populations.
Related Publications
Explore these studies to deepen your understanding of the subject.