logo
ResearchBunny Logo
An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

Medicine and Health

An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

F. E. Shamout, Y. Shen, et al.

This research presents an innovative AI system that predicts the deterioration of COVID-19 patients in emergency settings. Leveraging a deep neural network analyzing chest X-rays alongside a gradient boosting model focused on clinical data, this system has shown considerable promise in enhancing patient triage. Conducted by a team of experts, including Farah E. Shamout and colleagues, the study emphasizes the potential of technology in improving clinical outcomes during critical times.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the urgent need for reliable triage tools for COVID-19 patients presenting to emergency departments, where resource constraints and incomplete disease understanding complicate decision-making. Because pulmonary involvement is a primary complication and chest X-ray is a first-line, low-cost, low-radiation modality, the authors hypothesize that machine learning models can learn prognostic signals from chest X-rays and routine clinical variables to predict short-term deterioration (intubation, ICU admission, or mortality). The purpose is to build and evaluate an AI system that estimates the risk and temporal evolution of deterioration to support ED triage and resource allocation.
Literature Review
Most imaging-based COVID-19 research has emphasized diagnosis rather than prognosis. Existing prognostic models typically rely on routinely collected clinical variables (vital signs and labs) known to predict deterioration. Some efforts have quantified chest X-ray severity via deep learning or manual scoring, but the prognostic role of deep learning on X-ray images remains underexplored, and prior work has not combined imaging and clinical variables in a unified prognostic system. The authors posit that imaging and clinical data contain complementary information for predicting deterioration, motivating a multi-modal approach.
Methodology
Study design and data: Retrospective development and evaluation using chest X-rays and associated clinical variables from NYU Langone Health during March 3–May 13, 2020. Initial dataset: 19,957 chest X-ray exams from 4,722 patients with PCR-confirmed COVID-19. After applying inclusion/exclusion criteria (exclude exams without reports or encounter info, exclude patients still hospitalized by cutoff date, ensure images precede adverse events, exclude already-intubated), the final dataset comprised 7,502 exams from 4,204 patients. Patient-level split: training included ED and inpatient exams; test set included ED exams only. Training set: 5,224 exams (5,617 images) from 2,943 patients. Test set: 770 exams (832 images) from 718 patients. Both frontal and a small number of lateral images (<50 total) were included. Outcome definition: Deterioration is a composite of first occurrence of intubation, ICU admission, or in-hospital mortality; time to first event used if multiple occurred. Models: - Imaging model (COVID-GMIC): A deep convolutional neural network based on the Globally-Aware Multiple Instance Classifier (GMIC). Architecture includes: (1) global module producing saliency maps per time window (24, 48, 72, 96 h) via 1×1 conv + sigmoid; (2) local module extracting six ROI patches guided by saliency maps and processing each with a ResNet-18 to obtain features, with gated attention to weight ROI importance; (3) fusion module concatenating pooled global features and attention-weighted local representation to produce final prediction per time window via sigmoid. Input resolution H=W=1024; loss combines binary cross-entropy for global, local, and fusion outputs with L1 regularization to promote saliency sparsity. During inference, fusion output is used. Model produces saliency maps for interpretability. - Clinical variables model (COVID-GBM): Gradient boosting decision tree (LightGBM) taking 58 features: demographics (age, weight, BMI), vital signs (heart rate, respiratory rate, temperature, systolic/diastolic BP, oxygen saturation, supplemental oxygen) and laboratory tests (24 types represented as max/min within 12h before vital signs, yielding 48 processed lab features). Missing values handled by carry-forward of most recent measurement; otherwise left missing for LightGBM. Predicts probabilities for each time window. - Ensemble: Linear combination Y_ENSEMBLE = λ·ŷ_COVID-GMIC + (1−λ)·ŷ_COVID-GBM, with λ selected on validation by optimizing average of AUC and PR AUC for the four tasks. Deterioration risk curves (COVID-GMIC-DRC): A discrete-time survival analysis variant using a GMIC-based network to estimate conditional probabilities p_i = P(T ≤ t_i | T > t_{i−1}) for t ∈ {3, 12, 24, 48, 72, 96, 144, 192 h}. DRC(t) = 1 − Π_{i: t_i ≤ t} (1 − p_i). Training minimizes negative log-likelihood accommodating uncensored and censored data. Uses ResNet-34 backbone with input H=W=512; outputs from global/local/fusion branches, with fusion used at inference. Calibration assessed via reliability plots; discrimination via concordance index among patients with events. Image preprocessing: DICOM extraction, thresholding/normalization, cropping zero-valued borders, center-cropping/rescaling to uniform size; data augmentation with random horizontal flip (p=0.5), rotation (−45° to 45°), and translation. Training and validation: GMIC models pretrained on ChestX-ray14; Adam optimizer, minibatch size 8; test-time augmentation with 10 transforms averaged. Hyperparameters tuned via random search with Monte Carlo cross-validation (80/20 splits, 3 random splits per configuration). Selected top three configurations, ensembled nine models by averaging predictions on the held-out test set. LightGBM hyperparameters tuned similarly. Evaluation: Metrics include ROC AUC, PR AUC with 95% CIs via 1000 bootstrap iterations; PPV/NPV reported after thresholding to reflect training set class distribution. Reader study: 200 frontal ED X-rays (one per patient) sampled with stratification to enrich for adverse outcomes; two chest radiologists (3 and 17 years’ experience) provided probability estimates across time windows. Prospective silent deployment: preliminary image-only ensemble (15 DenseNet-121 models) served via TensorFlow Serving in real-time on CPU; evaluated on 375 ED exams (May 22–June 24, 2020).
Key Findings
- Test-set performance (n=832 images): The ensemble (COVID-GMIC + COVID-GBM) achieved AUCs of 0.765 (24 h), 0.749 (48 h), 0.769 (72 h), and 0.786 (96 h) with 95% CIs: (0.712–0.817), (0.700–0.798), (0.724–0.818), (0.745–0.830). PR AUCs: 0.243, 0.332, 0.439, 0.517 with 95% CIs: (0.150–0.299), (0.237–0.410), (0.345–0.527), (0.429–0.600). - Individual models: COVID-GBM outperformed logistic regression baselines across all windows and generally outperformed COVID-GMIC; imaging and clinical models captured complementary information, with ensembling consistently improving performance. - Interpretability: COVID-GMIC saliency maps highlighted clinically relevant regions (e.g., airspace opacities, consolidation) and guided ROI extraction; examples demonstrated alignment with areas associated with deterioration. - Reader study (n=200): COVID-GMIC achieved AUCs comparable to or exceeding two radiologists for ≥48 h windows; for the 96 h task, AUC was 0.741 for COVID-GMIC versus 0.740 and 0.713 for radiologists A and B, respectively; PR AUCs also competitive. - Deterioration risk curves (COVID-GMIC-DRC): Demonstrated good discrimination with concordance index 0.713 (95% CI: 0.682–0.747) at 96 h; reliability plots indicated good calibration across time points. - Prospective silent deployment (image-only, 15-model ensemble) on 375 exams with 10.1% positive outcomes within 96 h: AUC 0.717 (95% CI: 0.622–0.801), PR AUC 0.289 (95% CI: 0.181–0.465), comparable to retrospective performance; shifts in event prevalence and cohort characteristics likely contributed to modest performance decrease. - Feature importance (clinical model): Temperature and age ranked among the most predictive features across time windows.
Discussion
The findings support the hypothesis that chest X-ray images contain prognostic information for short-term deterioration in COVID-19, and that combining imaging with routine clinical variables improves prediction over either modality alone. The interpretable GMIC-based imaging model provides saliency maps that highlight lung regions associated with risk, facilitating clinical understanding and trust. The multi-label formulation enables simultaneous estimation of risk across multiple short-term horizons, while the DRC model captures temporal evolution of risk with evidence of good calibration and discrimination. Comparison with radiologists suggests that data-driven methods can match or exceed human performance for prognosticating deterioration from ED chest radiographs, likely because subtle prognostic patterns may not correspond directly to visually assessed severity. Prospective silent deployment demonstrated technical feasibility and real-time operation within clinical workflows on commodity hardware, indicating translational potential. Together, these results demonstrate the value of multi-modal AI systems to support ED triage during pandemics and similar high-demand scenarios.
Conclusion
This study introduces and validates a multi-modal AI system that predicts short-term deterioration of COVID-19 patients in the ED using chest X-rays and routine clinical variables, and models the temporal evolution of risk via deterioration risk curves. The ensemble model achieved strong discrimination and calibration, performed comparably to radiologists in a reader study, and was prospectively validated in real-time silent deployment. Contributions include an interpretable GMIC-based architecture for prognostic imaging, an effective fusion with clinical data, and a discrete-time survival modeling approach for risk trajectories. Future work should explore more sophisticated multi-modal fusion strategies, incorporate longitudinal imaging and additional EHR data (e.g., symptoms, non-pulmonary complications), improve calibration for survival modeling with clinical variables, and conduct external, prospective validations across diverse sites and protocols to assess generalizability and clinical impact.
Limitations
- Single health system data (NYU Langone Health) limits generalizability; potential overfitting to local demographics and imaging protocols. - Internal retrospective and prospective (silent) validation only; performance may vary with changes in outcome prevalence and clinical practices over time. - Prospective deployment evaluated imaging-only models without integration of clinical variables or clinical interventions. - COVID-GMIC processes single images without modeling longitudinal radiographic changes. - DRC model did not include clinical variables due to calibration considerations; broader multi-modal survival calibration remains future work. - Reader study radiologists did not have access to clinical variables; interpretability assessment was qualitative due to lack of segmentation ground truth. - Limited number of lateral radiographs; potential impact on model performance for non-frontal views.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny