Medicine and Health
Screening and diagnosis of cardiovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging
Y. (. Wang, K. Yang, et al.
Discover how Yan-Ran (Joyce) Wang and colleagues have revolutionized cardiac diagnostics with groundbreaking computerized interpretation of cardiac magnetic resonance imaging, achieving superior performance in diagnosing cardiovascular diseases. This study promises to enhance the efficiency of CMR interpretation and improve patient outcomes.
~3 min • Beginner • English
Introduction
Cardiovascular diseases are the leading cause of death worldwide, with approximately 17.9 million deaths annually and a disproportionate burden in low- and middle-income countries. CMR provides comprehensive assessment of cardiac morphology, function, perfusion, and tissue characterization and is considered the gold standard for cardiac function assessment and CVD diagnosis. Despite its clinical value, CMR adoption is limited by the time-intensive nature of interpretation, the need for substantial training, and a shortage of qualified experts. Automated CMR interpretation therefore has high potential to improve timely and accurate diagnosis. Deep learning can learn spatiotemporal features from raw cardiac cine videos and images without hand-crafted features, analyze all frames consistently, and potentially outperform human readers in integrating dynamic information. Prior CMR AI work has mostly targeted single tasks (for example, segmentation or specific lesion detection) rather than comprehensive multi-disease diagnosis. The aim of this study was to develop and validate an end-to-end, two-stage deep learning pipeline that mimics clinical workflow: (1) noninvasive anomaly screening from cine MRI, followed by (2) multi-class diagnosis of 11 CVDs using combined cine and late gadolinium enhancement (LGE) MRI. The authors selected a video-based Swin Transformer backbone to better model temporal and spatial dependencies in CMR sequences compared with conventional CNNs.
Literature Review
Previous deep learning studies in CMR have primarily focused on narrow tasks such as ventricular segmentation, wall thickness measurement, or the detection of limited pathologies (for example, myocardial scarring, chronic myocardial infarction, or aortic valve malformations). Comprehensive evaluation of end-to-end models for broad CVD screening and diagnosis from multimodal CMR (cine and LGE) has been lacking. Transformers have recently shown strong performance in video recognition tasks, suggesting potential advantages over CNNs for modeling cardiac motion and 3D sequences. The study addresses this gap by building a unified pipeline that performs anomaly screening and multi-class diagnosis across 11 common CVD categories using cine and LGE modalities.
Methodology
Study design and data: A nationwide, retrospective multi-center CMR dataset of 9,719 individuals from eight medical centers in China was curated. The dataset included a disease cohort (8,066 patients with 11 CVDs) and a normal control cohort (1,653 subjects). The primary development dataset came from Beijing Fuwai Hospital; external test sets were from seven other centers. CMR was acquired on GE (n=4,569), Philips (n=3,683), and Siemens (n=1,467) scanners. Cine sequences were 25 frames per cardiac cycle in short-axis (SAX) and four-chamber (4CH) views; SAX LGE covered LV base-to-apex. Threefold cross-validation was performed within the primary dataset for both stages. Overall, screening was tested on 9,719 subjects (internal and external), and diagnostics on 8,066 patients.
Two-stage AI pipeline: Stage 1 screening model performs binary classification (anomaly vs normal) using cine MRI (SAX and 4CH). Stage 2 diagnostic model classifies among 11 CVDs using combined inputs: SAX cine, 4CH cine, and SAX LGE. The backbone is a Video Swin Transformer (VST) with four stages using 3D windowed and shifted-window self-attention without temporal downsampling.
Preprocessing: All images were resampled to a common spatial resolution (0.994×0.994 mm). For SAX cine, three representative slices (mid, +2, −2 from mid) were used; each cine clip was subsampled to 13 frames (stride 2) from 25 frames. 4CH cine used the mid slice. SAX LGE was resampled along z to nine slices (common number across subjects). Heart ROI extraction for SAX cine, 4CH cine, and SAX LGE used nnU-Net-based detectors trained with semi-automatic labels (standard deviation maps + Otsu thresholding; manual correction when necessary). ROIs were cropped, padded to preserve aspect ratio, resized to 224×224, intensity-clipped at top/bottom 0.1%, scaled to 1–255, and normalized (zero mean, unit variance).
Model architectures and training: Cine inputs were tokenized as 3D patches into 128-d features; VST (base) used head numbers 4/8/16/32 across stages. For multimodal fusion, each branch (SAX cine, 4CH cine, SAX LGE) produced a 1,024-d feature (global average pooling); features were concatenated and fed to a fully connected layer and softmax. Branches were pretrained (ImageNet and Kinetics-600) and then frozen while training fusion layers (transfer learning). Optimization used AdamW with cosine decay and linear warmup (2.5 epochs), batch size 32, stochastic depth 0.2, weight decay 0.05; initial learning rates 1e-4 for backbone and 1e-3 for head (backbone LR multiplied by 0.1 relative to head); class-balanced sampling; 150 epochs for single-modality branches, followed by 20 epochs for fusion finetuning. Data augmentation: rotations (SAX/4CH ±45°/±20°), color jitter, brightness perturbation (±0.1), and for LGE: rotations and z-flips. Training used four NVIDIA RTX 3090 GPUs (~77 h). Inference time per subject ~0.233 s.
Comparators: A CNN-LSTM baseline with DenseNet-40-12 encoder and LSTM temporal fusion was trained with SGD (lr 0.001, momentum 0.9, weight decay 0.001), batch sizes 4/1 (train/test), and 64×64 inputs due to memory limits.
Evaluation: Metrics included AUROC with 95% CIs, frequency-weighted mean AUROC and F1, sensitivity, specificity, PPV, NPV. Screening threshold: 0.5; diagnostics: argmax class probability. Cross-validation on primary dataset and external validation across seven centers. A gold-standard test set (n=500) covered 11 CVDs for head-to-head comparison with physicians (experience 3–5, 5–10, >10 years). Interpretability used guided Grad-CAM; modality contributions were assessed via Shapley values. Independent consecutive real-world test set of 1,000 admissions in 2023 at Beijing Fuwai Hospital was used for additional validation (screening n=961; diagnostics n=532 with complete cine+LGE and in-scope classes).
Clinical labels: Diagnoses followed guideline-based criteria for CAD/ischemic cardiomyopathy, HCM, DCM, LVNC, RCM, CAM, HHD, myocarditis (Lake Louise criteria or biopsy), ARVC (revised Task Force Criteria), PAH (RHC: mPAP ≥25 mmHg, PCWP <15 mmHg, PVR >3 WU), and Ebstein’s anomaly. Normal controls were volunteers without CVD per history, exam, ECG, and echocardiography. IRB approvals were obtained at all centers; informed consent was waived; data were de-identified.
Key Findings
- Screening performance (cine SAX+4CH): Internal AUROC 0.986 (95% CI 0.984–0.988), F1 0.977 (95% CI 0.974–0.979); sensitivity 0.973 (95% CI 0.968–0.978) at 90% specificity. External AUROC 0.990 (95% CI 0.986–0.992), F1 0.970 (95% CI 0.964–0.977); sensitivity 0.959 (95% CI 0.936–0.974) at 90% specificity; specificity 0.970 (95% CI 0.950–0.990) at 90% sensitivity. Single-view screening: 4CH AUROC up to 0.980 external; SAX up to 0.971 internal.
- Diagnostic performance (cine+LGE): Internal class-weighted AUROC 0.991; F1 0.906. External class-weighted AUROC 0.991; F1 0.884. All classes AUROC >0.96 internally; most classes F1 >0.80 except LVNC, HHD, myocarditis. Exemplars: HCM AUROC 0.998 (95% CI 0.997–0.999), F1 0.975 (95% CI 0.971–0.980); DCM AUROC 0.988 (95% CI 0.986–0.990), F1 0.896 (95% CI 0.884–0.907); CAD AUROC 0.991 (95% CI 0.988–0.994), F1 0.921 (95% CI 0.908–0.935); PAH AUROC 0.998 (95% CI 0.995–1.000), F1 0.962 (95% CI 0.937–0.984).
- Modality contribution: Combining cine and LGE outperformed any single modality with +1.9 percentage points average AUROC and +6.8 percentage points average F1 versus SAX cine alone. All sensitivity/specificity trade-offs exceeded 90% (Extended Data Table 4). External single-modality diagnostic F1: cine 0.831; LGE 0.792.
- Interpretability: Grad-CAM localized salient regions consistent with known pathophysiology—LV-dominant classes (HCM, DCM, CAD, LVNC, RCM, CAM, HHD, myocarditis) highlighted LV; RV-dominant classes (ARVC, PAH, Ebstein’s anomaly) highlighted RV. LGE patterns (e.g., subendocardial/transmural in CAD, diffuse in CAM, subepicardial in myocarditis) were captured; distinctive features like LV apical trabeculation (LVNC) and tricuspid septal leaflet displacement (Ebstein’s) were identified.
- Human comparison (n=500): AI frequency-weighted F1 0.931 versus physicians >10 years 0.927 (accuracy 0.932 vs 0.928). AI interpretation time ~1.94 minutes total vs 418 minutes (>10 years), 576 minutes (3–5 years), 329 minutes (5–10 years). AI outperformed in PAH (F1 0.983 vs 0.931 for >10 years), including CMR-negative cases.
- Model backbone comparison: VST outperformed CNN-LSTM on SAX cine with +3.5 percentage points AUROC and +4.6 percentage points F1 on the primary dataset.
- Independent consecutive 2023 test set: Screening (n=961) AUROC 0.984 (95% CI 0.977–0.990), F1 0.962 (95% CI 0.953–0.972); sensitivity 0.946 at 90% specificity. Diagnostics (n=532) class-weighted AUROC 0.986, F1 0.903; high per-class metrics for HCM, DCM, CAD; CAM F1 0.947 and AUROC 1.0. Myocarditis, LVNC, HHD, RCM had relatively lower F1 but AUROC >0.90.
Discussion
The two-stage, video-based transformer pipeline accurately detects cardiac anomalies from noninvasive cine MRI and classifies 11 CVDs using combined cine and LGE inputs, addressing the bottleneck of expert-intensive CMR interpretation. The strong internal, external, and consecutive-test-set performance demonstrates robustness across vendors, institutions, and real-world prevalence. The finding that single-view cine (especially 4CH) can achieve high screening AUROC suggests potential simplification of CMR acquisition, improving efficiency, patient throughput, and tolerance. Diagnostic gains from multimodal fusion confirm the complementary clinical value of cine (function and motion) and LGE (fibrosis/viability). AI matched or surpassed expert cardiologists, notably in PAH where it detected CMR-negative cases, indicating sensitivity to subtle or previously under-recognized features and suggesting opportunities for less invasive pathways than right heart catheterization when appropriate. Grad-CAM analyses aligned with known disease anatomy and tissue features, supporting face validity. The superiority of VST highlights the benefit of global spatiotemporal self-attention for medical video analysis, enabling end-to-end learning from raw sequences without manual feature engineering.
Conclusion
This study demonstrates an end-to-end, video-based deep learning framework for automated CMR interpretation that (1) noninvasively screens for cardiac anomalies from cine MRI and (2) accurately classifies 11 common CVDs using combined cine and LGE, achieving cardiologist-level performance and generalizing across multiple centers and vendors. The approach can substantially improve the efficiency and scalability of CMR workflows and may enable less invasive diagnostic strategies in selected conditions (for example, PAH). Future work should focus on prospective clinical trials, integration of additional CMR modalities (T1/T2 mapping, ECV), incorporation of clinical history and risk factors, expanded disease coverage including phenocopies and dual pathologies, improved interpretability, and adaptive human-AI collaboration (for example, deferral mechanisms) to maximize reliability and clinical impact.
Limitations
- Prospective validation and clinical trials are needed; retrospective performance may not fully translate to real-world deployment.
- All participating institutions were in eastern Asia; generalizability across diverse ethnicities and practice settings requires investigation.
- The number of healthy controls was relatively limited, warranting further assessment of screening under real-world prevalence.
- Diagnostic coverage was limited to 11 CVD classes; performance for phenocopies and dual conditions (for example, Fabry disease, metabolic cardiomyopathies) was not fully assessed.
- Certain diagnoses (for example, myocarditis) had lower F1, likely due to timing of imaging, transient edema, and lack of additional modalities (T2-weighted, parametric mapping).
- Only cine and LGE were used; absence of quantitative T1/T2 mapping and ECV likely constrained diagnostic performance in some entities.
- Interpretability analyses (Grad-CAM) support face validity but are not comprehensive; deeper interpretability is needed.
- The diagnostic model still requires clinician oversight; a deferral strategy may improve reliability in ambiguous cases.
Related Publications
Explore these studies to deepen your understanding of the subject.

