Medicine and Health
Deep learning based automatic detection algorithm for acute intracranial haemorrhage: a pivotal randomized clinical trial
T. J. Yun, J. W. Choi, et al.
The study addresses the challenge of promptly and accurately diagnosing acute intracranial haemorrhage (AIH), a condition with high short-term mortality. Although MRI can detect AIH, CT remains the first-line modality due to availability, speed, cost, and patient tolerance. Interpreting brain CT for AIH is challenging, with risks of false negatives (delayed care) and false positives (unnecessary workup), compounded by high imaging volumes. Recent deep learning advances promise to improve diagnostic accuracy and efficiency. However, most prior AIH algorithms rely on expert-supervised labeling, which is susceptible to inter-observer variability due to heterogeneous AIH appearances. The authors hypothesize that combining supervised hemorrhage detection with unsupervised anomaly detection, alongside a joint CNN-RNN architecture to better leverage 3D contextual information, can improve AIH detection performance and assist clinicians across expertise levels. They aim to develop such an algorithm, validate it on a large external dataset, and evaluate its impact on reader performance in a randomized multi-reader study.
Prior work on AIH detection has focused on supervised learning with CNN-based models for detection, classification, and segmentation, showing good performance but limited by dependence on expert labels and inter-observer discrepancies due to variability in AIH size, morphology, and attenuation. Emerging approaches have explored joint CNN-RNN architectures to capture sequential context in CT stacks, reporting promising results for clinical workflow support. Unsupervised anomaly detection has been proposed to mitigate labeling limitations by learning normal appearances and flagging deviations. The literature highlights the need for robust performance on diverse external datasets and for evidence that AI assistance tangibly improves clinician diagnostic performance, particularly among non-experts.
Study design: Development and validation of a deep learning-based AI algorithm (Medical Insight™- Brain Hemorrhage, SK Inc. C&C) for automatic AIH detection on brain CT, with external validation and a retrospective, multi-reader, randomized, crossover study. IRB approvals were obtained at Seoul National University Hospital (IR-2007-061-1140) and Ajou University Medical Center (AJIRB-DE23-20-379); consent was waived.
Development dataset: 104,466 slices from 3010 patients (2066 with AIH; 1000 without AIH) from two institutions (A: Apr 2009–Dec 2015; B: Apr 2004–Apr 2020). Slice thickness predominantly 5 mm (87.4%). AIH group included isolated AIH and AIH related to other pathologies (e.g., hemorrhagic transformation).
External validation dataset: 1,855,465 slices from 49,841 patients (6442 AIH; 44,399 normal) collected at six Korean institutions in 2020 via AI Hub (https://hub.kr/aidata/3410). Of AIH patients, subtype counts included subarachnoid, subdural, epidural, intraventricular, and intraparenchymal hemorrhages (overlapping subtypes possible). Slice-level AIH counts similarly recorded by subtype.
Reader study dataset: 12,663 slices from 296 complete CT studies (146 AIH; 150 normal) from Institutions A (Jan 2016–Dec 2019) and B (Apr 2004–Apr 2020). Power calculations targeted 148 CTs per arm allowing 15% dropout. Gold standard established by consensus of three neuroradiologists (≥11 years radiology, ≥7 years neuroradiology); two initial independent reads with adjudication; weighted kappa 0.9865 for patient-wise agreement. Final AIH distribution at patient level: 146 AIH with overlapping subtypes; slice-level: 2508 AIH slices with subtype breakdown.
AI algorithm: Combines (1) supervised hemorrhage detection and (2) unsupervised anomaly detection. Hemorrhage detection uses a CNN to extract features and initial AIH scores per slice, followed by a double-layer RNN sequence module to incorporate inter-slice context and produce refined slice-wise and aggregated patient-wise AIH scores, addressing 3D context limitations of 2D CNNs. Anomaly detection uses a generative model based on a variational autoencoder (VAE) and a generative adversarial network (GAN) trained on normal CT slices to reconstruct normal images; differences between input and reconstruction yield anomaly maps highlighting potential hemorrhage regions. Outputs include patient-wise and slice-wise probability scores and heatmaps, integrated into PACS; mean additional time to access AI-assisted images was 97.4 seconds.
Reader study design: Nine reviewers (three non-radiologist physicians with 5–7 years’ experience; three board-certified radiologists with 5–7 years; three neuroradiologists with 7–11 years) participated from multiple sites. Randomized, crossover, pivotal design with two image sets (A and B): in session 1, A = original + AI-assisted images; B = original only. After 4–5 week washout, session 2 crossed over assistance. Reviewers were blinded to gold standard and AIH prevalence; assessments were patient-wise and slice-wise using a PACS viewer.
Performance evaluation: Standalone AI performance on external validation computed patient- and slice-wise: accuracy, sensitivity, specificity, PPV, NPV, F1, AUROC. Reader study outcomes compared AI-assisted vs unassisted sensitivity, specificity, accuracy using chi-square tests; logistic regression with generalized estimating equations (GEE) estimated effects and 95% CIs; inter-observer agreement analyzed via intra-class correlation. For external validation, correctness referenced dataset labels at a 50% probability cutoff; for standalone AUROC analysis vs gold standard, the same cutoff was used for point estimates.
External validation (standalone AI):
- Patient-wise (N = 49,841): Accuracy 0.977; Sensitivity (Recall) 0.944; Precision (PPV) 0.894; F1 0.913; Specificity 0.982; AUROC 0.992; NPV 0.992 (Table 1).
- Slice-wise (N = 1,855,465): Accuracy 0.985; Sensitivity 0.832; Precision 0.810; Specificity 0.977; AUROC 0.991 (Table 1). The narrative also reports AUROC 0.977 (slice-wise) and sensitivity/specificity of 79.0%/99.3%.
Reader study (AI-assisted vs unassisted): Patient-wise (N = 296 patients)
- Overall accuracy improved from 94.71% (95% CI 93.79–95.53) to 97.03% (96.32–97.65); difference +2.33% (1.26–3.19), p < 0.0001.
- Overall sensitivity improved from 94.73% (93.12–95.61) to 97.18% (96.29–98.08); difference +2.82% (1.28–4.35), p = 0.0003.
- Overall specificity improved from 95.04% (93.74–96.13) to 96.89% (95.82–97.75); difference +1.85% (0.37–3.34), p = 0.0145.
- Subgroups (accuracy): Non-radiology physicians: 91.88% → 95.03% (+3.15%, p = 0.0072); Board-certified radiologists: 94.59% → 97.41% (+2.82%, p = 0.0025); Neuroradiologists: 97.64% → 98.65% (+1.01%, p = 0.1138).
Reader study: Slice-wise (N = 12,663 slices)
- Overall accuracy: 95.25% → 95.31%; difference +0.59 (0.42–0.76), p < 0.0001.
- Overall sensitivity: 82.99% → 84.69%; difference +1.70 (1.02–2.38), p < 0.0001.
- Overall specificity: 98.24% → 98.55%; difference +0.32 (0.20–0.43), p < 0.0001.
- Subgroup highlights: Accuracy improved significantly for non-radiology physicians (+0.87, p < 0.0001) and board-certified radiologists (+0.75, p < 0.0001); neuroradiologists showed a non-significant change (+0.15, p = 0.2345). Sensitivity gains significant for board-certified radiologists (+3.62, p < 0.0001) and neuroradiologists (+1.14, p = 0.0264); non-radiology physicians change not significant (+0.33, p = 0.6324). Specificity increased modestly overall; subgroup specificity changes were mixed and mostly non-significant.
GEE analyses: AI assistance significantly increased accuracy in both patient-wise (difference 0.66, p = 0.007) and slice-wise analyses (difference 0.18, p < 0.0001). Sensitivity increased significantly patient-wise (difference 1.35, p = 0.017) but not slice-wise (p = 0.3237). Specificity gains were significant slice-wise (difference 0.41, p < 0.001) but not patient-wise (p = 0.3706).
Overall, AI assistance yielded the greatest performance improvements among non-radiologist physicians, with significant benefits also observed for board-certified radiologists; neuroradiologists showed smaller, often non-significant gains.
The findings support the hypothesis that a hybrid AI approach—combining supervised hemorrhage detection with unsupervised anomaly detection and leveraging a CNN-RNN architecture for 3D contextual modeling—can achieve high standalone performance for AIH detection and meaningfully assist clinicians. Standalone AUROC values approached expert performance, and AI assistance improved reader accuracy and sensitivity, particularly among non-radiologist physicians, thereby potentially reducing false negatives and expediting care in time-critical AIH. Improvements in specificity were modest and not consistently significant across subgroups, suggesting the tool is best positioned as decision support rather than a replacement for expert judgment. The CNN-RNN sequence modeling likely enhanced patient-wise assessments by integrating slice context, while anomaly detection mitigated limitations of supervised labeling and inter-observer variability. These results align with and extend prior literature showing AI can match or approach expert performance and augment clinician detection of critical CT findings. However, variable slice-wise sensitivity gains and subgroup differences indicate areas where algorithm optimization and workflow integration could further enhance performance.
The study presents a deep learning-based AI algorithm for AIH detection on brain CT that integrates supervised hemorrhage detection (CNN-RNN) with unsupervised anomaly detection (VAE/GAN). The algorithm demonstrated strong standalone performance on a large, diverse external dataset and improved clinician diagnostic performance in a randomized multi-reader study, most notably among non-radiologist physicians. Future work should evaluate clinical utility in real-world settings, including effects on time-to-diagnosis, morbidity/mortality, and workflow, assess AIH evolution metrics (e.g., hemorrhage volume and expansion) via follow-up imaging, and validate performance using richer clinical context. Prospective studies and broader, multi-center deployments are warranted to confirm generalizability and impact.
- Reader study environment did not replicate routine clinical practice; limited or no use of patient clinical information may affect generalizability.
- Gold standard determination is challenging for subtle AIH; reliance on expert consensus may introduce bias. Although inter-rater agreement was very high, some cases were reclassified during consensus.
- Class imbalance and retrospective design introduce potential selection bias.
- Slice-wise standalone sensitivity was lower than patient-wise, contributing to smaller or inconsistent slice-wise sensitivity improvements in readers.
- Specificity improvements were modest and often non-significant across subgroups, indicating limited impact on reducing false positives.
- External validation leveraged a national dataset labeled by participating radiologists; potential heterogeneity in labeling and scanners persists despite large scale.
Related Publications
Explore these studies to deepen your understanding of the subject.

