logo
ResearchBunny Logo
Deep learning from "passive feeding" to "selective eating" of real-world data

Medicine and Health

Deep learning from "passive feeding" to "selective eating" of real-world data

Z. Li, C. Guo, et al.

Discover how a groundbreaking deep learning-based image filtering system (DLIFS) enhances AI diagnostic performance for ocular fundus diseases. This innovative approach, developed by Zhongwen Li and colleagues, filters out poor-quality images, ensuring more accurate diagnostics in real-world applications. Find out how this research can transform AI in healthcare!

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses a key challenge in deploying deep learning-based AI diagnostic systems for ocular diseases in real-world settings: image quality variability. While ultra-widefield fundus (UWF) imaging is widely used for screening ocular fundus diseases and numerous AI systems have demonstrated high performance using only good-quality images, their effectiveness on real-world, mixed-quality images is uncertain. Factors such as patient noncompliance, operator error, hardware imperfections, and obscured optical media often degrade image quality, risking loss of diagnostic information and compromising downstream analysis. Manual quality grading is labor-intensive and impractical for high-throughput screening. The research objective is to develop and validate an automated deep learning-based image filtering system (DLIFS) that detects and filters poor-quality UWF images, ensuring that only good-quality images are forwarded to subsequent AI diagnostic systems, and to evaluate whether this approach improves diagnostic performance for multiple retinal conditions in real-world datasets.
Literature Review
Prior AI studies in ophthalmology and dermatology have shown high diagnostic accuracy with high-quality images (e.g., AUC ~0.99 for diabetic retinopathy and >0.9 for skin cancer). Multiple UWF image-based AI systems exist for retinal diseases, but they were trained and evaluated on good-quality images only, limiting direct applicability to real-world mixed-quality data. Earlier automated fundus image quality assessment methods (primarily on conventional, not UWF, images) used hand-crafted features (illumination, naturalness, structure) or vessel clarity-based metrics, achieving sensitivities around 94–100% and specificities around 92–99% on small datasets of 80–216 images. Deep learning-based quality assessment has also been explored for non-UWF images. However, prior to this work, no automated filtering system existed specifically for UWF cameras or UWF-based AI pipelines. This study fills that gap with a large, multi-institutional dataset and demonstrates integration benefits with established diagnostic systems.
Methodology
Datasets: 36,070 UWF images (19,684 individuals) were collected from the Chinese Medical Alliance for Artificial Intelligence (CMAAI) between June 2016 and September 2019 using an OPTOS Daytona nonmydriatic camera (200-degree field). Sources included Shenzhen Eye Hospital (15,322 images), Huazhong screening program (7,387), Eastern Guangdong Eye Study (4,929), and Southern China Guangming Screening program (8,432). Two external validation datasets were used: ZOC (Zhongshan Ophthalmic Centre) with 1,532 images from 828 individuals, and XOH (Xudong Ophthalmic Hospital) with 2,960 images from 1,177 individuals. All images were deidentified; IRB approval obtained (ZOC 2019KYPJ107). Participants were unmydriatic. Quality criteria: An image was labeled poor quality if any applied: (1) >1/3 of fundus obscured; (2) macular vessels not identifiable or >50% macula obscured; (3) vessels within 1-disc diameter of optic disc margin not identifiable. Otherwise, labeled good quality. Criteria applicable to lesions across peripheral/posterior retina. Labeling and reference standard: Three board-certified retina specialists (≥5 years UWF experience) independently labeled images (good vs poor), masked to model outputs. Consensus among all three defined the reference standard. Disputed images were arbitrated by a senior retina specialist (>20 years). Of 679 disputed images, 223 were poor quality and 456 good. Overall, 40,562 images were used with 32,661 good-quality and 7,901 poor-quality images. Preprocessing and augmentation: Images were resized to 512×512, pixel values normalized to [0,1]. Data augmentation (train set only) included random horizontal/vertical flips, rotations up to 90°, and brightness shifts (0.8–1.6), expanding training samples 5× (from 25,241 to 126,205 images). Model and training: The DLIFS used InceptionResNetV2 initialized with ImageNet-pretrained weights. Binary cross-entropy loss and Adam optimizer (adaptive momentum) were used. Training proceeded up to 180 epochs with early stopping if validation loss failed to improve for 60 consecutive epochs; the model with the lowest validation loss was retained. CMAAI data were split 7:1.5:1.5 into training, validation, and test sets with no individual overlap. Evaluation: Performance metrics included AUC (empirical bootstrap, 1000 replicates), sensitivity and specificity (95% CI via Wilson score). ROC curves were plotted. External validation on ZOC and XOH datasets used the same reference standards and metrics. Saliency map visualizations (gradient-based) were generated on ZOC data to highlight regions influencing poor-quality classification. Impact on established diagnostic systems: Previously developed AI systems (trained on good-quality images only) for lattice degeneration/retinal breaks (LDRB), glaucomatous optic neuropathy (GON), and retinal exudation/drusen (RED) were evaluated on external datasets with and without DLIFS filtering. Performance was compared across three scenarios: mixed-quality (no DLIFS), good-quality only (with DLIFS), and poor-quality only. Where applicable, training/validation datasets for these systems did not overlap with the external evaluation datasets. Statistical comparisons of disease proportions between good and poor-quality images used two-proportion Z-tests (two-sided, α=0.05).
Key Findings
- Dataset composition: 40,562 UWF images from 21,689 individuals (age 3–86 years; mean 48.3; 44.3% female). After adjudication, 32,661 good-quality and 7,901 poor-quality images. - DLIFS performance: - CMAAI test set: AUC 0.996 (95% CI 0.995–0.997); Sensitivity 96.9% (96.3–98.3); Specificity 96.6% (96.1–97.1). - ZOC external set: AUC 0.994 (95% CI 0.989–0.997); Sensitivity 95.6% (92.9–98.3); Specificity 97.9% (97.1–98.7). - XOH external set: AUC 0.997 (95% CI 0.995–0.998); Sensitivity 96.6% (94.5–98.7); Specificity 98.8% (98.4–99.2). - Visual interpretability: Saliency heatmaps consistently highlighted blurred/obscured regions in poor-quality images, aiding understanding and potential workflow integration. - Impact on established AI diagnostic systems (AUCs): - ZOC dataset: - GON: Good-quality 0.988 (95% CI 0.980–0.994); Mixed-quality 0.964 (0.952–0.975); Poor-quality 0.810 (0.739–0.879). - RED: Good-quality 0.967 (0.954–0.979); Mixed-quality 0.941 (0.924–0.957); Poor-quality 0.808 (0.731–0.879). - XOH dataset: - LDRB: Good-quality 0.990 (0.983–0.995); Mixed-quality 0.947 (0.927–0.967); Poor-quality 0.635 (0.543–0.729). - GON: Good-quality 0.995 (0.993–0.997); Mixed-quality 0.982 (0.976–0.987); Poor-quality 0.853 (0.791–0.907). - RED: Good-quality 0.982 (0.969–0.993); Mixed-quality 0.947 (0.928–0.965); Poor-quality 0.779 (0.710–0.848). - Applying DLIFS (good-quality only) increased sensitivities of LDRB, GON, and RED systems versus mixed-quality inputs, with comparable specificities. - Disease distribution differences (poor vs good quality): - ZOC: GON 27.6% (67/242) vs 20.0% (258/1290), p=0.009; RED 25.2% (61/242) vs 18.1% (233/1290), p=0.01. - XOH: LDRB 14.2% (45/317) vs 5.9% (156/2643), p<0.001; GON 23.3% (74/317) vs 8.4% (306/2643), p<0.001; RED 24.6% (78/317) vs 7.2% (190/2643), p<0.001. - Overall, 27.7% (67/242, ZOC) and 30.3% (96/317, XOH) of poor-quality images required further clinical investigation per established AI systems.
Discussion
The DLIFS successfully addresses the challenge of variable image quality in real-world UWF imaging by accurately detecting and filtering poor-quality images, thereby ensuring downstream diagnostic AI systems operate on reliable inputs. High AUCs across internal and external datasets demonstrate strong generalizability. Heatmap visualizations enhance interpretability and can guide photographers to reimage targeted regions. Integrating DLIFS prior to disease-specific AI systems significantly improves diagnostic performance in clinical settings; the notable drop in AUCs on poor-quality images underscores that models trained on good-quality images do not generalize to poor-quality inputs. Moreover, the higher prevalence of referable disease in poor-quality images suggests that flagged cases may warrant clinical attention, aligning DLIFS outputs with triage priorities. Operationally, DLIFS enables real-time feedback during acquisition (reacquisition or referral if persistent poor quality) and seamless routing of good-quality images to diagnostic pipelines, supporting scalable screening and telemedicine workflows.
Conclusion
A deep learning-based image filtering system (DLIFS) was developed and validated to automatically distinguish poor- from good-quality UWF fundus images with high sensitivity and specificity across multiple centers. Pre-filtering with DLIFS enhances the performance of established AI diagnostic systems in real-world settings by ensuring input quality and providing interpretable heatmaps to guide acquisition. The work advocates a shift from passive acceptance of all real-world inputs to selective preprocessing for image-based AI. Future research should integrate DLIFS with diverse disease-detection models, refine feedback to identify specific causes of poor quality to guide corrective actions, and develop strategies to reduce human-factor-related poor-quality acquisitions while balancing referral burden.
Limitations
- Referring all cases with poor-quality images could increase healthcare burden due to false positives; strategies to reduce unnecessary referrals are needed. - DLIFS does not identify specific causes of poor quality (e.g., motion blur, media opacity, eyelid obstruction), limiting tailored corrective feedback during acquisition. - Although multi-center and large-scale, generalizability to other devices, populations, and imaging protocols warrants further study.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny