logo
ResearchBunny Logo
Predicting 1, 2 and 3 year emergent referable diabetic retinopathy and maculopathy using deep learning

Medicine and Health

Predicting 1, 2 and 3 year emergent referable diabetic retinopathy and maculopathy using deep learning

P. Nderitu, J. M. N. D. Rio, et al.

Discover how advanced deep learning systems developed by Paul Nderitu and colleagues can revolutionize diabetic retinopathy screening by predicting the progression of sight-threatening conditions up to three years in advance. This innovative research offers a promising solution for high-risk individuals while alleviating the screening burden on those at low risk.

00:00
00:00
~3 min • Beginner • English
Introduction
Diabetic retinopathy (DR) is common among people with diabetes and a leading cause of vision loss. The UK Diabetic Eye Screening Programme (DESP) screens millions annually using colour fundus photographs (CFPs), but most attendees are low-risk and current annual screening is resource intensive. Key clinical decisions hinge on identifying emergent referable disease (referable DR or referable maculopathy). Existing non-AI risk models often require invasive blood tests not routinely collected in screening, complex data linkage, and/or rely on prior retinopathy grades with limited specificity, potential delays, and sensitivity issues for some ethnic groups. Recent deep learning (DL) studies suggest CFPs can predict progression, but many had limited populations, excluded mild DR, or did not separately predict maculopathy. This study aims to develop and validate tabular, image, and multimodal DL systems that, from a single screening visit, predict emergent referable DR, emergent referable maculopathy, or either at 1-, 2-, and 3-year horizons, enabling individualized risk-based screening.
Literature Review
The authors review prior risk-based approaches for DR progression: non-AI systemic risk factor models and grade-based models that often require HbA1c/lipid tests, complex data linkage, and have limited specificity for identifying low-risk individuals; some require two sequential visits and may delay detection, with disparities noted in certain ethnic groups. Early DL studies showed feasibility using CFPs, but often included small, non-representative cohorts or excluded mild DR patients, limiting applicability. More recent DL works (e.g., Bora et al., Rom et al., Dai et al.) predicted progression using CFPs, with reported AUROC values around 0.82–0.89 for external validation depending on interval and outcome. However, prior studies generally did not independently model emergent referable DR and emergent referable maculopathy, despite their differing risks to vision. This study addresses these gaps by jointly developing tabular, image, and multimodal DLS to predict both outcomes independently across 1-, 2-, and 3-year intervals with internal and external validation.
Methodology
Design: Retrospective longitudinal prognostic modeling using UK DESP data, adhering to Helsinki and TRIPOD. Ethics approvals obtained; consent waived for pseudoanonymised data. Population and datasets: Individuals aged ≥12 years screened between Sept 2013–Dec 2019. Data from SEL-DESP (27 sites) for development/internal testing and BSBC-DESP (110 sites) for external testing. After curation, 1,181,858 gradable two-field CFPs from 102,446 individuals (SEL-DESP) and 66,286 CFPs from 4,772 individuals (BSBC-DESP) were included. The SEL-DESP dataset was split 80% development (88% train, 12% tune) and 20% internal test; 100% of BSBC-DESP used for external testing. Longitudinal cohort formation: From 162,339 development-set eyes, 110,837 had eligible longitudinal data; 51,502 eyes without valid longitudinal pairs were used for pretraining. Valid baseline–predict visit pairs were 1, 2, or 3 years apart with tolerances ±2, ±4, or ±6 months respectively. Baseline visits with referable DR or maculopathy were excluded. Eyes were included in 1-, 2-, or 3-year cohorts as applicable; individuals contributed to only one partition per interval. Outcomes: Three binary outcomes predicted per interval: (1) emergent referable DR (R2+; moderate non-proliferative or worse), (2) emergent referable maculopathy (M1), and (3) either outcome (R2+ | M1). Outcome positive if it occurred anytime between baseline and predict visit. Inputs: Two-field CFPs (macula and nasal) and risk factor characteristics (age, sex, ethnicity, diabetes type, diabetes duration, best visual acuity, index of multiple deprivation rank). Preprocessing: Images center-cropped to circular mask, resized to 512×512, left eyes horizontally flipped, normalized with ImageNet constants. Tabular: continuous variables median-imputed and normalized; categorical variables embedded. Models: Image DLS used two EfficientNet-V2-s backbones (one per field, no weight sharing) with concatenated pre-classification feature maps; initialized with ImageNet weights. Tabular DLS used TabNet with random initialization. Multimodal DLS formed by averaging the trained image and tabular DLS predictions at test time (early fusion via feature concatenation underperformed). Training: Pretraining on cross-sectional classification of DR severity (R0, R1, R2, R3a) and maculopathy (M0/M1). Longitudinal training optimized a multi-label binary objective to predict emergent referable DR, referable maculopathy, or either, plus an auxiliary objective to detect baseline mild–moderate DR (R1). Primary and auxiliary losses were equally weighted. Data augmentations, hyperparameters, and further details are in Supplementary materials. Statistical analysis: Performance summarized by AUROC with 95% CIs via DeLong; stratified by age, sex, ethnicity. Eye-to-individual aggregation took max risk across eyes vs worst outcome across eyes. Bonferroni correction to p<0.017 across the three predefined outcomes. Specificity at 80% sensitivity reported for comparability to prior UK systemic risk-factor model. PPV/NPV curves with 95% CIs by Clopper–Pearson. False negatives analyzed by predicted-visit grade. Image ablation and attribution localized salient regions; tabular attributions assessed characteristic importance.
Key Findings
- Performance (multimodal DLS AUROC): - Internal test: 1-year R2+ 0.95 (0.92–0.98), M1 0.84 (0.82–0.86), either 0.85 (0.83–0.87); 2-year R2+ 0.92 (0.87–0.96), M1 0.84 (0.82–0.87), either 0.85 (0.82–0.87); 3-year R2+ 0.85 (0.80–0.90), M1 0.79 (0.76–0.82), either 0.79 (0.76–0.82). - External test: 1-year R2+ 0.93 (0.88–0.97), M1 0.85 (0.80–0.89), either 0.85 (0.76–0.85); 2-year R2+ 0.93 (0.89–0.97), M1 0.79 (0.74–0.84), either 0.80 (0.76–0.85); 3-year R2+ 0.91 (0.84–0.98), M1 0.79 (0.74–0.83), either 0.79 (0.74–0.84). - Image and multimodal DLS significantly outperformed tabular DLS across all intervals and outcomes (Bonferroni-adjusted p<0.017). Multimodal showed modest gains over image-only, particularly for maculopathy (+0.01 to +0.06 AUROC). - Generalisation: Minimal AUROC drop externally (−0.02 to −0.06). Pretraining and auxiliary baseline R1 detection improved internal AUROC by +0.2 to +0.5. - Individual-level AUROC closely matched eye-level results. - Specificity at 80% sensitivity (multimodal): Internal—1y R2+ 0.96, M1 0.69, either 0.74; 2y R2+ 0.92, M1 0.69, either 0.70; 3y R2+ 0.79, M1 0.58, either 0.58. External—1y R2+ 0.92, M1 0.76, either 0.80; 2y R2+ 0.93, M1 0.58, either 0.60; 3y R2+ 0.87, M1 0.59, either 0.59. - Safety signals: Among progressing eyes, undetected progression to proliferative DR was rare (internal/external: 1y 0.2%/0.0%; 2y 0.3%/0.0%; 3y 0.6%/0.0%). Most false negatives were missed referable maculopathy (~14–20% depending on interval and dataset). - Predictive values: Low incidence of referable outcomes led to high NPVs and constrained PPVs, with better PPV/NPV for image/multimodal vs tabular. - Subgroups: AUROCs broadly consistent across age, sex, ethnicity; occasional underperformance (≥0.10 below mean) appeared in only one interval per outcome and only in one of the test sets. - Ablation/attribution: Masking macula field reduced performance substantially, especially for maculopathy and combined outcome; masking nasal field had little effect. Attribution maps highlighted central macula as salient for predictions. Tabular attributions associated higher risk with longer diabetes duration, younger age, male sex, Black/mixed/unspecified ethnicity, and worse visual acuity; socioeconomic deprivation had limited effect.
Discussion
The study demonstrates that deep learning using single-visit screening data can accurately predict emergent referable DR and referable maculopathy across 1–3 year horizons. Image-based and multimodal models substantially outperform risk-factor-only models, indicating that CFPs encode rich prognostic information and that adding routine demographic/clinical variables provides incremental benefit, notably for maculopathy. Performance generalized well to an external DESP with different demographics and devices, and safety analyses showed very low rates of missed emergent proliferative DR at an 80% sensitivity operating point. Comparisons to prior literature show competitive or superior AUROCs and address a gap by independently modeling DR and maculopathy progression. Attribution and ablation analyses localize prognostically important features to the central macula, suggesting subclinical foveal changes may precede clinically apparent disease. These findings support using DLS as a risk-stratification engine to individualize screening intervals—reducing burden for low-risk individuals while enabling timely referral and potential early treatment for high-risk patients.
Conclusion
The authors developed and externally validated tabular, image, and multimodal deep learning systems that accurately predict emergent referable DR and emergent referable maculopathy at 1-, 2-, and 3-year intervals using non-invasive, single-visit screening data. Image and multimodal approaches outperform tabular risk-factor models, with modest multimodal gains. These systems could enable individualized, risk-based diabetic eye screening, guide earlier preventive interventions, and inform service planning. Future work should assess prospective safety and effectiveness, explore broader architectures and training strategies, validate across additional devices and populations, refine operating thresholds, and examine implementation aspects such as fairness, cost-effectiveness, and human–AI workflows.
Limitations
- Potential biases in self-reported/tabular data (e.g., ethnicity) and possible data entry errors, though mitigated by electronic data transfer (GP2DRS). - Grading variability may affect ground-truth labels; however, UK DESP quality assurance and the clear definition of referable disease reduce impact. - Residual biases may persist due to acquisition differences, grading, screening provision, or societal inequalities; subgroup incidence was low, limiting robust subgroup conclusions. - Imaging device variation between datasets; generalization to other devices requires further validation. - Prediction horizons limited to 1–3 years; shorter or longer intervals were not assessed. - Not exhaustive in architectures/techniques; alternatives (e.g., foundation models like RETFound, transformers, unsupervised pretraining, distillation) and additional auxiliary tasks or survival-based modeling warrant exploration. - Need for larger validation cohorts and prospective studies to evaluate safety, fairness, acceptability, cost, and integration; determination of clinically appropriate operating thresholds remains future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny