Medicine and Health

Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning

C. Bock, J. E. Walter, et al.

This paper reveals how machine learning (ML) can enhance the diagnosis of functionally relevant coronary artery disease (fCAD), outperforming cardiologists and potentially reducing unnecessary imaging procedures. The innovative approaches presented by Christian Bock, Joan Elias Walter, Bastian Rieck, and colleagues could shape the future of cardiac healthcare.... show more

Introduction

Coronary artery disease (CAD) is a leading cause of death worldwide, and early risk stratification to detect functionally relevant CAD (fCAD) is crucial to prevent adverse events such as premature death or nonfatal acute myocardial infarction. Current screening methods either have limited diagnostic accuracy (e.g., stress ECG alone) or are costly and expose patients to radiation (e.g., myocardial perfusion imaging, coronary CT angiography). Clinical guidelines discourage sole use of stress ECG due to high false positive/negative rates, yet it remains widely used. The study aims to develop and validate two machine learning models to predict stress-induced fCAD: (1) an ensemble model using basic clinical variables and (2) a deep learning model leveraging ECG signals from exercise stress testing plus clinical data. The hypothesis is that these ML models can outperform cardiologists’ post-test probability assessments and reduce unnecessary imaging while maintaining safety at clinically relevant rule-out thresholds.

Literature Review

Prior automated ECG-based approaches focused on morphological features (e.g., ST-segment changes) requiring ECG delineation, which may be inaccurate in abnormal beats. Deep learning has achieved cardiologist-level performance in arrhythmia detection and has been explored for stress testing, but prior work often used many variables impeding transferability, relied on summary statistics or less precise outcome definitions, lacked comprehensive subcohort analyses, and lacked external validation. Recent cardiology guidelines discourage stress ECG alone due to low accuracy. Conventional ML with static clinical variables can perform comparably to deep learning in some healthcare tasks. This study is the first to investigate collaborative ML combining cardiologists’ judgement with ML/DL for predicting abnormal myocardial perfusion, addressing gaps including external validation, subcohort performance, and interpretability.

Methodology

Study design and cohorts: Stress test ECG data from 3522 consecutive adults undergoing standard rest/stress myocardial perfusion SPECT (MPI-SPECT/CT) at a tertiary hospital (BASEL VIII, NCT01838148) were collected. Patients had symptoms suggestive of inducible myocardial ischemia; those unable to reach target heart rate underwent pharmacologic stress (adenosine/dobutamine). Cardiologists recorded pre- and post-test visual analogue scale (VAS) probabilities (0–100%) for fCAD based on all available pre-imaging information. Adjudication of fCAD was centrally performed by an expert team using MPI perfusion scans (summed stress/rest/difference scores; ischaemia thresholds per guideline), refined with coronary angiography and fractional flow reserve (FFR) when available; 701 (20%) had angiography within 3 months, with 30 reclassified to fCAD and 74 to non-ischaemic. Adjudication reflected clinical practice and was not blinded to stress ECG or demographics. Data splits: The dataset was split temporally: development 75% (Jan 2010–Dec 2014; n=2648) and held-out test 25% (Dec 2014–May 2016; n=874). The development set was further divided into five stratified splits with training/validation/calibration subsets (approx. 36977, 9254, and 5129 2-6-2 sequences from 1882, 471, and 260 patients, respectively). Bootstrapping (25 draws, sampling 80% of pooled predictions) provided uncertainty and enabled statistical testing. External validation: 916 consecutive treadmill stress-test patients from two Israeli medical centers (THEW SUI: E-OTH-12-0927-015) were used, differing in modality (treadmill), younger age (mean 55 vs 68 years), and lower ischaemia prevalence (7.5%). High-resolution 12-lead ECG at 1000 Hz with 16-bit resolution was recorded. Exclusions included pacemaker, AF at testing, or QRS ≥120 ms. Stress phases were inferred via heart-rate maxima to construct 2-6-2 sequences. Patients missing required variables or labels were excluded. ECG acquisition and preprocessing: Resting vitals and 12-lead ECG recorded pre-exercise; standard upright bicycle exercise test performed with meds paused; pharmacologic stress if indicated. ECG devices: Schiller AT-110 and CS-200 Excellence at 500 or 1000 Hz, bandwidth 0.05–150 Hz. All 1000 Hz signals downsampled to 500 Hz. Preprocessing schemes explored: (1) none; (2) minimal (high-pass Butterworth 0.5 Hz, order 5 + moving average); (3) thorough (bandpass 0.05–150 Hz, moving-median subtraction, Savitzky–Golay smoothing, winsorizing). For deep learning, ECG length reduced from ~500,000 to 5000 points by concatenating 2 s pre-stress + 6 s stress + 2 s recovery (2-6-2 sequences), sampled up to 20 times per patient. Models: Conventional ML (CARPEclin): Eight non-sequential clinical variables (age, weight, sex, height, resting heart rate, systolic and diastolic blood pressure, prior CAD) trained with decision trees, random forests, logistic regression, and SVMs; best model was a random forest selected via 5-fold CV. Deep learning (CARPEECG): Multi-task learning neural network with a ResNet backbone for ECG and a 2-layer MLP for clinical features. The concatenated embedding feeds four heads: main task (fCAD) and three auxiliary tasks—MPSSR (summed rest perfusion score), MPSSS (summed stress score), and stress type (exercise vs pharmacologic). Predictions for fCAD are averaged across 2-6-2 sequences per patient. Auxiliary task weights selected via grid search on top-performing leads. Model selection involved grid-search over lead sets, preprocessing, and learning rates, followed by auxiliary-task weighting search. Collaborative model (CARPEColl): Logistic regression combining the cardiologist’s post-test VAS score, CARPEclin score, and CARPEECG score, trained on the development set to produce a combined risk score. Baselines, evaluation, and statistics: Automated ST-segment depression computation used QRS delineation (neurokit2), isoelectric baseline prior to Q-wave, and ST amplitude 60 ms after J-point; differences across phases aggregated. Performance assessed on ROC and PR curves, with decision curve analysis to evaluate net benefit at clinically relevant rule-out thresholds (5%, 10%, 15%). Calibration assessed (reliability curves, Brier scores). Statistical tests: one-sided Kolmogorov–Smirnov for AUROC/AUPRC distributions (Bonferroni-corrected), Welch’s t-tests for age comparisons and SHAP distributions, Fisher’s exact for odds by CAD history. SHAP used for post-hoc interpretability at population and per-sample levels, including ECG-segment attribution analyses.

Key Findings

On the held-out test set (fCAD prevalence 28%), CARPEECG and CARPEclin achieved mean AUROCs of 0.71 and 0.70, respectively, both outperforming the cardiologist (0.64) and the automated ST-depression method (0.58). The ML vs cardiologist AUROC difference was significant (e.g., abstract reports p = 4.0E-13 for 0.71 vs 0.64).
Decision curve analysis showed higher net benefit for both ML models than the cardiologist’s assessment across rule-out thresholds (5–15%). At the 15% threshold, relying on the cardiologist was worse than imaging all patients, whereas ML provided a positive net benefit.
Imaging reduction potential at rule-out thresholds: • At 15%: CARPEECG could reduce MPI by 15.3% (95% CI 5.4–25.3) with sensitivity 0.89 ± 0.02 and NPV 0.90 ± 0.01; CARPEColl could reduce by 17.3% (7.4–27.1) with sensitivity 0.89 ± 0.02 and NPV 0.91 ± 0.01. • At 10% (all patients): CARPEECG sensitivity 0.94 ± 0.01, NPV 0.92 ± 0.02, avoided imaging 20.6% (5.4–35.7); CARPEColl avoided 24.6% (11.1–38.1) with sensitivity 0.96 ± 0.01, NPV 0.94 ± 0.02. • At 5% (all patients): CARPEECG sensitivity 0.98 ± 0.01, NPV 0.96 ± 0.02, avoided imaging 12.8% (0.4–25.1); CARPEColl avoided 11.7% (0.1–23.4).
Subcohorts: Deep learning particularly strong in younger patients (e.g., AUROC 0.78 ± 0.04 for <65 years; 0.79 ± 0.04 in younger patients who completed exercise testing). Gains over cardiologist were largest in younger cohorts (up to +0.19 AUROC, +0.15 AUPRC). In males, CARPEclin underperformed relative to cardiologist, while CARPEECG outperformed. In ≥65 years, combining predictions (CARPEColl) significantly improved over individual models.
External validation (prevalence 7.5%, treadmill modality): CARPEECG outperformed CARPEclin (AUROC 0.80 ± 0.01 vs 0.75 ± 0.004; AUPRC 0.28 ± 0.02 vs 0.19 ± 0.01). Performance differences across age strata highlighted conventional model’s overreliance on age (threshold-like behavior around 70 years) with larger deficits in younger groups (e.g., 26–49 years).
Interpretability: SHAP analyses showed conventional ML relied heavily on CAD history, sex, and an age cutoff (~70 years), reducing generalizability. DL model exhibited more balanced use of features and ECG segments. ECG attributions highlighted ST-segment during stress as strongly contributing to higher fCAD risk, corroborating clinical knowledge.
Calibration: On the held-out test set, both ML methods had significantly lower Brier scores than the cardiologist (cardiologist 0.23 ± 0.009; random forest 0.22 ± 0.004, p = 3.59E-5; CARPEECG 0.18 ± 0.006, p = 7.97E-16).

Discussion

The study demonstrates that ML models, particularly a deep learning approach integrating raw stress-test ECG and basic clinical features, can surpass cardiologists’ post-test probability estimates in predicting stress-induced fCAD. This improved discrimination, together with favorable decision curve analyses, translates into potential reductions in unnecessary myocardial perfusion imaging at clinically endorsed rule-out thresholds without sacrificing safety (high sensitivity and NPV). The deep learning model’s consistent superiority across external data and younger subgroups likely stems from multi-task learning regularization and leveraging ECG time-series patterns beyond static clinical factors. SHAP-based interpretability aligns model behavior with established pathophysiology (e.g., stress-phase ST-segment depression), while revealing that conventional ML can overemphasize age and sex and learn threshold-like rules that undermine generalization. Importantly, combining ML predictions with cardiologists’ judgements via logistic regression further improved performance in several cohorts, suggesting complementary strengths: ML counters cognitive biases and inconsistency in calibration, whereas clinical expertise mitigates algorithmic biases and distribution shifts. These findings support ML-augmented decision-making to optimize downstream testing and resource use, but underscore the need to evaluate real-world clinical utility, recalibration across settings, and the impact of interpretability tools on clinician decisions.

Conclusion

Two validated ML models—an ensemble model using eight basic clinical variables and a deep learning model using ECG time series plus clinical data—outperformed cardiologists for predicting functionally relevant CAD after stress testing. At a 15% rule-out threshold, the models could reduce unnecessary MPI by approximately 15–17% while maintaining high sensitivity and NPV. External validation across institutions and modalities confirmed generalization, with deep learning particularly robust in younger cohorts and under distributional shifts. Post-hoc interpretability corroborated the clinical relevance of ST-segment depression and exposed limitations of conventional models’ reliance on age and sex. A collaborative approach blending ML predictions with clinician judgement further improved performance, indicating a practical path toward clinical adoption. Future work should include prospective, multi-center studies to establish clinical utility, careful model recalibration for new environments, assessment of interpretability dashboards on clinician behavior, exploration of alternative architectures (e.g., attention-based networks) and ensemble methods (e.g., gradient-boosted trees), and continuous monitoring of outcomes post-deployment.

Limitations

Potential misclassification: Despite stringent adjudication, some fCAD labels may be incorrect; adjudication was not blinded to clinical and stress ECG data, possibly overestimating these features’ influence.
Cohort and generalizability: Single tertiary-center, symptomatic patient cohort; women and patients of African/Asian descent were underrepresented; relatively few very young patients (<50 years), limiting applicability to these groups.
Bias and calibration: Cardiologists’ VAS scores and conventional ML exhibited calibration issues and feature overreliance (e.g., age, sex). Distribution shifts affect utility; models require recalibration before use in new settings.
Integration method: Logistic regression combining clinician and ML scores increases accuracy in retrospective analysis, but real-world clinicians are unaware of their score’s influence on combined models; clinical benefit and workflow impact remain unproven.
Modality differences: Internal data used bicycle ergometry; external treadmill data differ in noise characteristics and demographics, necessitating robust preprocessing and validation.
Data availability: Not all raw data are publicly available due to patient privacy; external dataset access requires agreements; missing data led to some exclusions in external validation.
Methodological scope: Other architectures (e.g., attention-based) and ML methods (e.g., gradient-boosted trees) were not explored; prospective trials are needed to establish clinical utility and safety.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Improving the accuracy of medical diagnosis with causal machine learning

J. G. Richens, C. M. Lee, et al.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Chemistry

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

A. Daina and V. Zoete

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny