Medicine and Health

Development and evaluation of an artificial intelligence system for COVID-19 diagnosis

C. Jin, W. Chen, et al.

Discover a groundbreaking AI system developed by Cheng Jin and colleagues, designed for rapid COVID-19 detection through chest CT scans. With impressive accuracy, this deep convolutional neural network outperforms radiologists and offers speedy diagnosis, making it a revolutionary tool in medical imaging.

00:00

~3 min • Beginner • English

Index

Introduction

The rapid global spread of SARS-CoV-2 necessitates fast and accurate diagnosis to control transmission and guide treatment. While RT-PCR is the reference standard, its sensitivity can be suboptimal due to low viral load or laboratory error, and test kit availability varies, especially in developing regions. Chest imaging (CT and CXR) has been widely used as a first-line tool; CT can reveal early lung lesions and achieves high sensitivity when interpreted by experienced radiologists. However, CT interpretation is time-consuming (hundreds of slices per scan) and distinguishing COVID-19 from other pneumonias (e.g., CAP, influenza) is challenging due to overlapping imaging features and intra-disease variability across stages. During potential concurrent outbreaks (e.g., COVID-19 with influenza), radiology workloads could exceed capacity. AI, particularly deep learning, has achieved expert-level performance in many medical imaging tasks and offers efficiency, repeatability, and scalability. Developing an AI system tailored to differentiate COVID-19 from other pneumonias is necessary given the high similarity among diseases and stage-dependent variability. This study aims to develop and evaluate a multi-class, multicenter AI system for COVID-19 diagnosis on CT, compare CT- and CXR-based performances using paired data, benchmark against experienced radiologists, and provide interpretable analyses linking network attention to CT phenotypes.

Literature Review

Several CT-based AI systems for COVID-19 have been reported. Zhang et al. developed a system on 4,154 patients differentiating COVID-19 from other pneumonias and normals with AUC 0.9797; their approach relied on lesion segmentation with a DICE of ~0.662, and manual segmentation is costly. Li et al. trained a slice-level feature extractor with volume-level fusion on 3,322 subjects (COVID-19, CAP, healthy), achieving AUC 0.96; their slice-to-volume fusion increases memory demand without leveraging richer 3D features. Other slice-level methods are similar, while some 3D CNN approaches addressed only binary classification. CXR-based systems exist but typically have fewer COVID-19 subjects and lack quantitative paired CT–CXR comparisons. These limitations motivate a large-scale, multi-class, interpretable CT-based system with rigorous external validation and paired modality comparison.

Methodology

Data: Multicenter retrospective dataset totaling 11,356 CT scans from 9,025 subjects across three Wuhan centers (Wuhan Union Hospital, Western Campus of Wuhan Union Hospital, Jianghan Mobile Cabin Hospital) and four public databases (LIDC-IDRI, Tianchi-Alibaba, MosMedData, CC-CCII). Classes: COVID-19, nonviral CAP, influenza-A/B, and non-pneumonia (including healthy and nodule datasets). Wuhan data included 4,260 scans from 3,177 subjects; COVID-19 cases (PCR-confirmed) collected Feb 5–Mar 29, 2020 (mild cases included), CAP from Jan–Nov 2019 (nonviral), influenza from Nov 2016–Nov 2019, healthy from Jan–Feb 2020 (PCR negative, no pneumonia on CT). For subjects with ≥3 scans, the last scan was excluded to avoid rehabilitative studies. Public datasets: LIDC-IDRI (1,009 scans) and Tianchi-Alibaba (1,200 scans) used as non-pneumonia; CC-CCII and MosMedData used for external testing only. Cohorts: Excluding MosMedData and CC-CCII, subjects were split ~1:1 into training (2,688 subjects; 3,263 scans) and test (2,688 subjects; 3,199 scans) with no subject overlap. External test cohorts: CC-CCII (2,539 subjects; 3,784 scans, differing category definitions and processed slices) and MosMedData (1,110 scans: 254 non-pneumonia, 856 COVID-19 with four severity grades mapped to COVID-19). Paired CT–CXR subset: localizer scans (CT scout views akin to CXR, noisier) available for 198 CAP and 468 COVID-19 in training; 220 CAP and 469 COVID-19 in test enabled CT vs CXR comparison and score-level fusion. AI system: Five components: (1) Lung segmentation (2D U-Net) trained on manually annotated slices; segmentations used as masks and to define lung bounding boxes. (2) Slice diagnosis network: 2D ResNet-152 (ImageNet-pretrained) classifying each slice into non-pneumonia, CAP, influenza-A/B, or COVID-19 (three classes for CC-CCII). Inputs were lung-masked, cropped slices and segmentation masks to normalize across scanners. (3) COVID-infectious slice locating network: same architecture, trained on COVID-19-positive subjects with manually marked lesion slices to detect lesion-containing slices. (4) Visualization and interpretation: Guided Grad-CAM to derive attentional regions on slices; binarization and morphological operations extracted attention masks. (5) Image phenotype analysis: Radiomics (pyradiomics) features from attention regions across multiple transforms (original, LoG, wavelet) and matrices (first-order, GLCM, GLSZM, GLRLM, NGTDM, GLDM); additional features included distance from lesion centroid to pleural edge, 2D contour fractal dimension, and 3D grayscale mesh fractal dimension. LASSO logistic regression selected the most discriminative features. Latent feature analysis used 2048-D vectors (max-pooled pre-FC feature maps) for t-SNE visualization. Volume-level fusion: A task-specific fusion block aggregated slice scores: for each pneumonia class (CAP, influenza, COVID-19), average the top-K (K=3) highest slice scores; non-pneumonia score is the average across all slices. For pneumonia-vs-non-pneumonia, pneumonia class scores were summed; for COVID-19-vs-other-pneumonia, non-target classes were muted during fusion. Threshold 0.5 used for sensitivity/specificity reporting; AUC computed for ROC analyses. Implementation: PyTorch 1.3.1; t-SNE with scikit-learn 0.22; numpy 1.15.3; scipy 1.3.3. Reader study: Three cohorts from the internal test set: (i) Pneumonia vs non-pneumonia: 100 subjects (50 non-pneumonia; 25 CAP; 25 COVID-19). (ii) CAP vs COVID-19: 100 subjects (50/50). (iii) Influenza-A/B vs COVID-19: 50 subjects (20/30). Five Wuhan Union Hospital radiologists (average 8 years experience; each read 3,000–5,000 CTs/year; 500–700 COVID-19 reads in prior 3 months) independently read cases using Slicer 4.10.2 with free window/zoom. Readers knew the task and possible classes per cohort. AI used fixed resampling (224×224×35) and window (−1200, 700). CT vs CXR: A CNN-based classifier was trained on localizer (CXR-like) images to distinguish COVID-19 from CAP; performance compared to CT-based system and simple score-level fusion evaluated.

Key Findings

- Internal test cohort (3,199 scans from 2,688 subjects): • Multi-way classification AUC 0.9781 (95% CI 0.9756–0.9804); accuracy 0.9151 (0.9115–0.9193). • Class-wise AUC/Sens/Spec: Non-pneumonia 0.9752 / 0.9343 / 0.9801; CAP 0.9804 / 0.9687 / 0.9407; Influenza-A/B 0.9885 / 0.8307 / 0.9945; COVID-19 0.9745 / 0.8703 / 0.9660. • COVID-infectious slice locating: AUC 0.9559; sensitivity 0.8009; specificity 0.9636. - External CC-CCII test cohort (2,539 subjects; 3,784 scans; differing category definitions and processed slices only): • Multi-way AUC 0.9299 (0.927–0.933); accuracy 0.8435 (0.8391–0.8483). • Class-wise: Normal 0.9541 / 0.8561 / 0.9524; Common pneumonia 0.9098 / 0.8823 / 0.8685; COVID-19 0.9212 / 0.7799 / 0.9355. - External MosMedData cohort (1,110 scans; 254 non-pneumonia, 856 COVID-19 with severities mapped to COVID-19): • COVID-19 detection AUC 0.9325 (0.9257–0.9382); sensitivity 0.9446; specificity 0.6613. - CT vs CXR (paired localizer subset of the internal test cohort): • CT-based COVID-19 diagnosis: AUC 0.9847 (0.9822–0.9877); sensitivity 0.9762; specificity 0.9125. • CXR-based (localizer) diagnosis: AUC 0.9527 (0.9474–0.9583); sensitivity 0.9623; specificity 0.7155. • Simple score-level fusion: AUC 0.9894; sensitivity 0.9469; specificity 0.9503. CT significantly outperformed CXR (p<0.001); fusion provided slight gains. - Reader study (five radiologists): • Pneumonia vs non-pneumonia: AI AUC 0.9869; sensitivity 0.9404; specificity 1.0000. Human readers achieved very high accuracy; AI slightly worse than readers on this easiest task. • CAP vs COVID-19: AI AUC 0.9727; sensitivity 0.9591; specificity 0.9199. AI outperformed all readers; 37.5% (3/8) of AI errors were also reader errors; 88.5% (23/26) of reader errors were correctly classified by AI. • Influenza-A/B vs COVID-19: AI AUC 0.9585; sensitivity 0.9496; specificity 0.8331. Readers averaged ~76% accuracy; 50% (3/6) of AI errors were also reader errors; 86.9% (20/23) of reader errors were correctly classified by AI. • Speed: Mean human reading time 6.5 minutes per case vs AI 2.73 seconds. - Interpretability and phenotype analysis: • Guided Grad-CAM showed class-dependent attention: for CAP, emphasis on pleural-adjacent consolidation and effusion; for COVID-19, emphasis on ground-glass opacities (GGO); influenza and COVID-19 attention patterns overlapped (e.g., stripes, GGO) yet were distinguishable by the model. • t-SNE of 2048-D features separated classes; COVID-19 distributed into multiple clusters indicating phenotypic subtypes (e.g., small round GGOs; larger lesions with crazy paving; intermediate with fibrosis/consolidation). • Radiomics on attention regions (665 features; LASSO-selected 12 plus distance/fractal features) found features significantly distinguishing CAP vs COVID-19 (t-test significant; KS test significant for some features). Influenza vs COVID-19 features were not significant, reflecting imaging similarity. - Subset analyses: • Diagnostic performance varied with age: fewer infectious slices and lower performance in younger patients; infectious slice ratio increased with age. • Minimal performance differences by stage (I vs II); stage fusion slightly improved results. • By gender, similar infectious slice counts but higher AUC in men than women.

Discussion

The AI system addresses the need for rapid, accurate differentiation of COVID-19 from CAP, influenza, and non-pneumonia on chest CT. On a large, multicenter, multi-class dataset, it achieved high internal performance and generalized well to two external datasets with different data characteristics and populations, indicating robustness. Against experienced radiologists, the AI matched or exceeded performance on the more challenging differential tasks (COVID-19 vs CAP; COVID-19 vs influenza) and operated two orders of magnitude faster, suggesting utility as an independent reader for triage, decision support, and error checking. Subset analyses revealed that diagnostic performance correlates with disease burden (fewer lesion-containing slices and lower AUC in younger patients) and showed small gender and stage effects; combining stages provided modest gains. The paired CT–CXR analysis demonstrated that CT provides superior specificity and overall discrimination compared with CXR-like localizers, though CXR retains diagnostic value and simple fusion can improve performance in some cases. Interpretability via Guided Grad-CAM and radiomics linked model attention to known CT manifestations (e.g., subpleural GGOs, crazy paving), identified discriminative phenotypes for CAP vs COVID-19, and suggested multiple COVID-19 imaging subtypes in latent space, bridging model predictions with pathophysiological insights.

Conclusion

This work presents a large-scale, multi-class AI system for COVID-19 diagnosis on chest CT that achieves high accuracy, outperforms experienced radiologists on challenging differentials, generalizes to external datasets, and operates in seconds per case. It provides interpretable outputs via attention visualization and radiomics-based phenotype analysis, enabling linkage between AI decisions and known imaging features. A paired evaluation establishes CT’s superiority over CXR for COVID-19 diagnosis, with potential incremental benefits from simple fusion. Future research should expand datasets to include more pneumonia subtypes and other lung diseases, improve lesion-level interpretability by incorporating accurate segmentation, and integrate comprehensive clinical data (e.g., comorbidities, lab values) to enhance diagnostic capability and enable additional functions such as severity assessment and prognosis.

Limitations

- Data scope: While large and multicenter, additional subtypes of pneumonia and other lung diseases would further test generalizability and expand differential capability. - Interpretability granularity: Guided Grad-CAM provides attention regions rather than precise lesion segmentation; phenotype analysis on exact lesion masks could yield more definitive feature associations. - Modality comparison: CXR analysis used CT localizer images, which are noisier than standard radiographs; this may underestimate CXR performance. Lack of tightly time-paired standard CXR and CT limits definitive modality comparisons. - External dataset differences: CC-CCII provides processed slices with differing category definitions, potentially lowering performance relative to native-volume data. - Motion/artifact effects: Some CTs with motion artifacts affected performance; more targeted data are needed to quantify artifact impacts and optimize fusion methods.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department

F. E. Shamout, Y. Shen, et al.

Medicine and Health

Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support

S. Protserov, J. Hunter, et al.

Chemistry

ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models

Y. Kang and J. Kim

Medicine and Health

An artificial intelligence-assisted microfluidic colorimetric wearable sensor system for monitoring of key tear biomarkers

Z. Wang, Y. Dong, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny