
Medicine and Health
Opportunistic detection of type 2 diabetes using deep learning from frontal chest radiographs
A. Pyrros, S. M. Borstelmann, et al.
Discover how a deep learning model, developed by a team of expert researchers, including Ayis Pyrros and Stephen M. Borstelmann, is transforming diabetes screening by analyzing chest radiographs and electronic health records. With an impressive ROC AUC of 0.84, this innovative approach could redefine type 2 diabetes detection and enhance patient outcomes.
~3 min • Beginner • English
Introduction
Type 2 diabetes mellitus (T2D) affects roughly 10% of the U.S. population, with major economic burden and many cases remaining undiagnosed. Prediabetes typically precedes T2D for 9–12 years, and current screening relies on fasting blood glucose and/or HbA1c. USPSTF recommends opportunistic screening every 3 years for adults aged 35–70 years who are overweight or obese. However, undiagnosed diabetes persists, particularly among older adults, those with obesity, racial/ethnic minorities, and populations with limited healthcare access. BMI is an imperfect predictor of T2D; visceral adiposity and the thin-fat phenotype (e.g., in Asian Indians) complicate risk assessment, and clinically accessible measures of fat distribution are limited. Given the ubiquity of chest radiographs (CXRs) in ambulatory care, and advances in deep learning (DL) applied to imaging and EHR data, the authors hypothesized that a DL model using frontal CXRs combined with demographic/administrative EHR features could detect prevalent T2D and identify individuals who might benefit from additional screening beyond BMI-based criteria.
Literature Review
Prior work has demonstrated DL’s ability to derive imaging biomarkers predictive of metabolic syndrome and other conditions from abdominal CT and to predict outcomes and comorbidities from CXRs, including future healthcare expenses and health disparities. Visceral fat has been identified as an important independent risk factor for T2D, while BMI is a limited proxy for adiposity distribution. The thin-fat phenotype has been described in certain populations, complicating clinical detection based on BMI alone. Alternative predictive models using extensive laboratory data or whole-body MRI exist but have limited opportunistic clinical utility due to cost and availability, highlighting the need for widely available imaging-based tools such as CXRs.
Methodology
Study design and datasets: The development cohort included 271,065 ambulatory frontal chest radiographs (CXRs) from 160,244 unique adult patients (2010–2021) from Duly Health and Care, with prospective evaluation on 9,943 unique patients imaged from 01/01/2022 to 07/31/2022. Internal retrospective validation used k-fold (5-fold) partitioning of the development set. External validation used 5,026 ambulatory frontal CXRs from Emory University (2019–2020). Portable CXRs and patients with type 1 or gestational diabetes were excluded. For test sets, a 1:1 mapping between CXR and patient was enforced.
Labels and features: Imaging labels were derived from ICD-10 HCC codes (2021 model) for six comorbidity classes, including T2D, with binary encoding. Additional continuous targets included age, BMI, HbA1c, and risk adjustment factor (RAF). EHR features extracted for logistic regression (LR) included age, sex, self-reported race/ethnicity, language preference, BMI, and social deprivation index (SDI, computed from zip codes) to assess socioeconomic context.
Model development: A multitask CNN based on ResNet34 (random initialization) was trained in PyTorch on resized 384×384 single-channel DICOM images. Training used 218,758 images; validation 24,529; testing 60,317 (90/10 split within the development period). Optimization used AdamW (initial LR 1e-3, halved after plateau up to min 1e-5), batch size 128, 23 epochs. Losses: binary cross-entropy for HCC classes; mean squared error for age, HbA1c, BMI, RAF. Data augmentation included random horizontal flips (20%), rotations (±10°), perspective distortion (0.2), brightness/contrast (0.8–1.2). Images were normalized using training-set mean 0.55 and SD 0.1885.
Explainability: Positive-pixel occlusion maps (Captum; 15×15 window, stride 8, baseline 0) identified image regions important for T2D prediction. A latent-shift autoencoder (“gifsplanation”) generated counterfactual animations illustrating feature changes that modulated T2D prediction.
Comparator and combined models: A clinical LR model (no imaging) used age, sex, race/ethnicity, language, BMI, and SDI. A combined model (DL with LR) added the CXR DL prediction as a covariate. Subgroup analyses evaluated: (1) all T2D vs others; (2) poorly controlled T2D (any HbA1c ≥ 9%) vs others; (3) USPSTF screening cohort (BMI ≥ 25, age 35–70); and (4) BMI < 25, age 35–70.
Statistical analysis: Performance was assessed via ROC AUC with 95% CIs; DeLong’s method compared AUCs (Holm-Bonferroni for multiple comparisons). Thresholds were selected by Youden’s J (optimal thresholds: 0.04–0.16; 0.10 for main prospective/retrospective analyses, 0.20 external). Sensitivity, specificity, PPV, NPV, and F1 were reported at the optimal threshold. Nonparametric tests (Wilcoxon rank-sum, Kruskal–Wallis) compared DL scores across groups. Time-dependent AUCs assessed incidence prediction at 1, 3, 5, and 10 years in the retrospective cohort. A priori power analysis targeted ≥8,452 predictions assuming 15% prevalence to achieve ≥80% power; the prospective test set had 9,943 predictions at 16% prevalence.
Implementation: Inference used an NVIDIA Triton server integrated with Epic Radiant and a SQL database for batch processing of new CXRs and writing predictions back to the EHR environment.
Key Findings
- Prospective test cohort (n=9,943): CXR DL model ROC AUC = 0.84 (95% CI: 0.83–0.85), outperforming the clinical LR model AUC = 0.79 (95% CI: 0.77–0.80; P < 0.001). Combined model (DL with LR) AUC = 0.85 (95% CI: 0.84–0.85; P = 0.16 vs LR alone).
- Poorly controlled T2D subgroup (HbA1c ≥ 9%): DL AUC = 0.85 (95% CI: 0.83–0.86).
- USPSTF screening subgroup (BMI ≥ 25, age 35–70): DL AUC = 0.80 (95% CI: 0.79–0.82); combined model AUC = 0.81 (95% CI: 0.80–0.83; P = 0.03 vs LR baseline).
- BMI < 25, age 35–70 subgroup: DL AUC = 0.89 (95% CI: 0.85–0.92), outperforming LR (P < 0.001).
- Model scores were significantly higher for T2D vs no T2D (median 0.29 vs 0.04; P < 2.2×10^-16) and increased from no T2D to controlled to poorly controlled T2D (P < 2.2×10^-16).
- Screening opportunities: 1,381 patients (14% of prospective cohort) without prior T2D diagnosis or HbA1c had DL scores above the threshold (0.10), including 147 who would not meet USPSTF screening (BMI < 25) and 70 with unknown BMI.
- Retrospective k-fold cohort incidence: 7,409/29,420 (25%) developed T2D after index CXR; 71% had DL prediction > 0.10 at baseline. Time-dependent AUCs: 1-year 0.80 (95% CI: 0.79–0.81); 3-year 0.79 (0.78–0.80); 5-year 0.79 (0.78–0.80); 10-year 0.78 (0.77–0.79). Mean delay from index CXR to diagnosis: 1,057 days (SD 1,005); median 738 days (IQR 256, 1,590).
- External validation (Emory; n=5,026): AUC = 0.77; incidence rate 20.4 (95% CI: 18–23) per 1,000 person-years; 249 patients (≈5%) diagnosed with T2D after initial CXR, with 146 (59%) flagged by the model at baseline.
- Model equity: No significant AUC differences across race/ethnicity groups after correction; slight difference by sex (male AUC 0.83 vs female 0.85; P = 0.045).
- Explainability: Occlusion maps and autoencoder counterfactuals implicated central adiposity (mediastinal, upper abdomen, supraclavicular regions) and rib/clavicle attenuation as predictive features, aligning with known associations of upper-body adiposity with insulin resistance and T2D.
Discussion
The study demonstrates that a DL model using routine frontal CXRs, augmented by minimal EHR data, can detect prevalent T2D with high discriminative performance, outperforming a clinical LR model based on demographic and administrative variables alone. The model is particularly effective among patients with BMI < 25, a group less likely to be flagged by BMI-based screening, and achieves sensitivity and specificity in the BMI ≥ 25 subgroup that exceed previously reported USPSTF guideline performance. The ability to identify poorly controlled T2D supports targeted follow-up and intensive management. Explainability analyses indicate the model leverages central adiposity and osseous attenuation cues, providing biologically plausible mechanisms that extend beyond BMI and may capture visceral fat distribution. The model’s performance in retrospective incidence prediction and external validation suggests potential for opportunistic population-level screening using existing imaging workflows, potentially improving early detection in underserved populations with limited access to primary care.
Conclusion
A deep learning model applied to ambulatory frontal CXRs can accurately identify patients with T2D, including poorly controlled cases, and can augment conventional, BMI-based screening strategies. Given the ubiquity and low cost of CXRs, this approach offers a scalable, opportunistic screening tool to trigger confirmatory HbA1c testing and facilitate earlier intervention, with potential to reduce downstream complications of diabetes. Future work should include larger, calibrated multi-institutional validations, prospective clinical impact studies, fairness and drift monitoring, integration with automated ordering pathways for confirmatory testing, and exploration of predictive value for future T2D among individuals with normal HbA1c.
Limitations
Key limitations include: (1) lack of fasting blood glucose data and incomplete HbA1c availability for many patients, which may affect labeling and training accuracy; (2) use of only ambulatory, non-portable CXRs without support devices, limiting generalizability to inpatient/portable settings; (3) limited and uncalibrated external validation with a smaller dataset; (4) absence of multi-year prospective follow-up for the test cohort; (5) inability to directly compare against USPSTF criteria in all eligible patients due to missing HbA1c data; and (6) need for further assessment of fairness, robustness to model drift, and out-of-sample performance prior to widespread clinical deployment. There is also debate about the utility of very early T2D detection, which should be considered when designing implementation pathways.
Related Publications
Explore these studies to deepen your understanding of the subject.