logo
Loading...
A generalised computer vision model for improved glaucoma screening using fundus images

Medicine and Health

A generalised computer vision model for improved glaucoma screening using fundus images

A. K. Chaurasia, G. Liu, et al.

Discover a groundbreaking study where a team of researchers, including Abadh K. Chaurasia and Puya Gharahkhani, have developed a high-performing deep-learning algorithm for glaucoma screening using diverse fundus images. With impressive metrics demonstrated on extensive datasets, this research paves the way for future advancements in population-level healthcare.... show more
Introduction

Glaucoma is a multifactorial optic neuropathy and a leading cause of irreversible vision loss. Early disease is often asymptomatic, making proactive screening essential. Conventional diagnostic methods (intraocular pressure measurement, OCT, visual fields) can be time-consuming, costly, require specialist interpretation, and may miss normal-tension glaucoma. Access to comprehensive eye care is limited in many regions, and existing screening approaches can yield high false-positive rates. Fundus photography is a non-invasive, widely available modality that captures optic nerve head features pertinent to glaucoma. Deep-learning methods applied to fundus images show promise, but many existing models are trained on limited, population-specific datasets and lack generalisability. This study aims to develop and validate a generalised deep-learning model for glaucoma screening using fundus images aggregated from 20 publicly accessible datasets spanning diverse populations and imaging devices.

Literature Review

Prior work has shown deep learning can detect glaucomatous optic neuropathy from fundus photographs, but many studies used single-institution or limited datasets, often leading to models that do not generalise well across populations or imaging conditions. Meta-analyses indicate deep-learning performance can approach or exceed ophthalmologists’ screening accuracy in some settings, though performance drops have been observed with multiethnic data and variable image quality. The literature also highlights challenges with inconsistent ground-truth definitions and dataset labelling quality, which can impede reliable model training and external validity. These gaps motivated assembling a large, diverse dataset and systematically comparing multiple CNN architectures to identify a robust, generalisable screening model.

Methodology

Datasets: Fundus images were aggregated from 20 publicly accessible datasets across 12 countries (plus two unknown), totaling 117,152 images. After exclusions (e.g., non-referral glaucomatous images from EyePACS to avoid bias; 512 ungradable images removed), a working disc-image dataset of 18,468 images (10,900 healthy; 7,568 glaucoma) from 19 datasets was used for model development; Drishti-GS was reserved for external validation. Ground truth labels in public datasets varied in source and rigor; Drishti-GS labels were derived from four experts (3–20 years’ experience). Preprocessing: Images varied (full fundus vs disc crops). OpenCV was used to detect and isolate the optic nerve head (ONH) via grayscale conversion, Gaussian blur, and Hough Circle Transform. ONH regions were cropped and resized to 512×512, then uniformly downscaled to 224×224×3 for model input. Images failing automated ONH detection or off-center ONH regions were manually cropped to preserve clinically relevant content. Data augmentation: To enhance robustness and mitigate overfitting, random rotations, crops, flips, scaling, lighting adjustments, affine transforms, and zoom were applied, with normalization using ImageNet statistics. Model selection and training: Twenty CNN architectures were compared using Fastai’s cnn_learner, including ResNets (18–152), VGG (16, 19 with batch norm), AlexNet, DenseNets (121–201), SqueezeNets (1.0, 1.1), GoogLeNet, ShuffleNet, ResNeXt (50_32x4d, 101_32x8d), and Wide ResNets (50_2, 101_2). All models were initially fine-tuned for 3 epochs and evaluated on a validation split (20% of 18,366 images), repeated three times with random splits. The best-performing model (VGG19 with batch normalization, pre-trained on ImageNet) was selected balancing performance and complexity. Fine-tuning: The vgg19_bn was further fine-tuned for 15 epochs until validation loss plateaued, then trained using the one-cycle policy, with early stopping when validation loss failed to improve for two consecutive epochs. Data cleaning and class imbalance: Fastai’s ImageClassifierCleaner was used iteratively to review and remove likely mislabeled or problematic images (~1% initially; ~7% cumulatively across seven shuffles/training cycles). The final cleaned dataset exhibited class imbalance (59.7% healthy; 40.3% glaucoma); a weighted cross-entropy loss was used during training. Interpretability: Grad-CAM was applied to visualise salient regions influencing predictions, expected to focus on ONH and RNFL regions for glaucomatous cases. Evaluation and statistics: Performance metrics included AUROC, sensitivity (recall), specificity, accuracy, precision, and F1-score. Bootstrap resampling (4,000 iterations) with replacement was used to estimate 95% confidence intervals. External validation was performed on Drishti-GS, which was excluded from training/validation. Computing environment: Training was conducted on an Ubuntu 22.04 VM with an NVIDIA A100 (40 GB). Software: Python 3.10.6; PyTorch 2.0.0+cu117; Fastai 2.7.12; TorchVision 0.15.1+cu117; Matplotlib 3.5.1; scikit-learn 1.2.2.

Key Findings
  • The best-performing architecture was vgg19_bn (pre-trained on ImageNet). - On internal validation, the model achieved AUROC 0.9920 (95% CI: 0.9920–0.9921) for both glaucoma and healthy classes. Accuracy 0.9671; glaucoma-class sensitivity 0.9530 and specificity 0.9768; healthy-class sensitivity 0.9768 and specificity 0.9530; precision 0.9654 (glaucoma) and 0.9683 (healthy); F1-score 0.9592 (glaucoma) and 0.9725 (healthy). - Misclassifications on validation included 47 false positives (healthy predicted as glaucoma) and 65 false negatives (glaucoma predicted as healthy). - External validation on Drishti-GS achieved AUROC 0.8751 and accuracy 0.8713, indicating some performance drop on unseen data but demonstrating notable generalisability. - Grad-CAM visualisations highlighted clinically relevant regions (optic nerve head, neuroretinal rim), supporting model interpretability.
Discussion

The model effectively addressed the need for a generalisable glaucoma screening tool by training on a heterogeneous, multi-dataset, multi-ethnic set of fundus images and systematically selecting a high-performing architecture. High AUROC and balanced sensitivity/specificity indicate suitability for screening where both missed cases and unnecessary referrals are concerning. Interpretability analyses (Grad-CAM) showed emphasis on clinically meaningful structures, aligning with expert assessment and supporting clinical trust. Compared to prior models often trained on single datasets, this approach improved generalisability; however, performance declined on external data (Drishti-GS), likely due to heterogeneity in labeling protocols, image quality, and device differences across public datasets. Error analysis suggested challenges with borderline/early cases and possible mislabeling in some datasets, emphasizing the importance of high-quality, clinically verified ground truth. Overall, findings support the feasibility of deploying a robust, transparent, and scalable fundus-based AI screening tool, while underscoring the need for broader validation and standardised labeling.

Conclusion

A generalised deep-learning model (vgg19_bn) for glaucoma screening using fundus images achieved high discriminative performance and balanced metrics on internal validation, with acceptable performance on external data. The approach demonstrates potential for population-level screening to support clinical decision-making, reduce unnecessary referrals, and improve early detection. Future work should: (1) validate across multiple external datasets, diverse populations (including underrepresented regions such as Africa) and imaging devices; (2) standardise and enhance ground-truth labeling using comprehensive clinical examinations; (3) integrate clinical variables (e.g., IOP, visual fields, history) to improve predictive power; and (4) develop strategies to better handle borderline cases and reduce sensitivity to artifacts/noise.

Limitations
  • Potential mislabeling within public datasets; although iterative data cleaning removed ~1,306 images (~7%), residual noise may remain. - Class imbalance (59.7% healthy, 40.3% glaucoma) addressed with weighted loss but may still influence performance. - Data augmentation may not perfectly reflect real-world clinical image distributions. - Grad-CAM occasionally highlighted non-relevant regions, indicating possible attention to artifacts. - Demographic imbalance: Non-Caucasians comprised most healthy cases, while Caucasians represented a substantial fraction of glaucoma cases; Hispanics underrepresented. - Lack of publicly available datasets from Africa limited geographic diversity. - External validation was performed only on Drishti-GS; broader external testing is needed. - No integration of clinical data (e.g., IOP, visual fields), which could enhance performance. - Heterogeneous and inconsistent glaucoma classification criteria across datasets likely contributed to performance variability and the external performance drop.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny