Introduction
Glaucoma, a major cause of irreversible blindness globally, affects millions. Early detection is crucial, but current methods, such as measuring intraocular pressure (IOP), optical coherence tomography (OCT), and visual field tests, are often limited by accessibility, cost, and subjective interpretation, particularly in resource-constrained settings. These limitations lead to delayed diagnosis and increased vision loss. The high false-positive rates associated with current screening also cause unnecessary anxiety and burden on patients. Therefore, innovative approaches are needed to improve glaucoma screening, especially in low- and middle-income countries. Computer vision, powered by deep learning, presents a promising avenue for enhancing the accuracy, efficiency, and accessibility of glaucoma detection. Fundus imaging, a non-invasive and widely available technique, offers rich data for analyzing the optic nerve head (ONH), a key indicator of glaucoma. Deep learning algorithms can automatically extract relevant features from fundus images to improve the diagnostic process. However, most existing deep learning models for glaucoma detection are trained on limited datasets from specific populations or regions, hindering their generalizability and applicability across diverse settings. This study aimed to address this limitation by developing and validating a generalized deep-learning-based algorithm for glaucoma screening using a large, diverse dataset of fundus images from multiple global sources.
Literature Review
Numerous studies have explored the application of deep learning to glaucoma detection using fundus images. These studies have shown varying degrees of success, often relying on datasets from single institutions or specific populations, thus limiting the generalizability of the models. Some studies have reported high accuracy in detecting glaucoma using Convolutional Neural Networks (CNNs) such as ResNet, DenseNet, and VGG, but these models frequently lack robustness when tested on datasets different from those used during training. The need for a model that can perform well across diverse populations and imaging modalities remains a significant challenge in the field. This study aims to build upon previous research by leveraging a significantly larger and more diverse dataset to develop a more generalizable model.
Methodology
This study utilized 20 publicly accessible glaucoma datasets from diverse global cohorts, totaling 117,152 fundus images. After excluding ungradable and non-referral images, the final dataset comprised 18,468 images (10,900 healthy, 7568 glaucomatous). The images underwent pre-processing using OpenCV to isolate the ONH and resize them to 224x224 pixels. Data augmentation techniques, including random rotation, cropping, flipping, and lighting adjustments, were applied to enhance the dataset and prevent overfitting. Twenty different pre-trained deep-learning architectures were compared, including various ResNets, VGGs, DenseNets, and others. Model selection was based on validation set performance after three epochs of fine-tuning. The best-performing model, vgg19_bn, was further fine-tuned for 15 epochs using a one-cycle policy. The dataset was cleaned using Fastai's ImageClassifierCleaner to remove misclassified images (~7%). A weighted cross-entropy loss function addressed class imbalance. Model performance was evaluated using AUROC, sensitivity, specificity, accuracy, precision, and F1-score, with 95% confidence intervals calculated using a bootstrap resampling method. Grad-CAM was used to visualize the model's decision-making process. The Drishti-GS dataset served as an independent external validation set.
Key Findings
The vgg19_bn model exhibited exceptional performance on the training dataset, achieving an AUROC of 0.9920 (95% CI: 0.9920–0.9921) for both glaucoma and healthy classes. Sensitivity, specificity, accuracy, precision, recall, and F1-scores were consistently above 0.9530. External validation on the Drishti-GS dataset, which was not included in the training process, resulted in an AUROC of 0.8751 and an accuracy of 0.8713. Analysis of the model's errors revealed that misclassifications often involved borderline cases or images with features that could be misinterpreted as glaucoma. Grad-CAM visualizations demonstrated that the model focused on clinically relevant regions such as the optic disc and cup when making classification decisions. The confusion matrix (Figure 1) and performance metrics (Table 2) provided a comprehensive assessment of the model's performance, revealing its high accuracy in distinguishing glaucomatous from healthy fundus images across diverse datasets. Although performance on the unseen Drishti-GS dataset was slightly lower than on the training dataset, the overall results indicate that the model can identify features associated with glaucoma with high accuracy.
Discussion
This study demonstrates the potential of a generalized deep-learning model for improving glaucoma screening. The high AUROC and other performance metrics achieved by the vgg19_bn model on both the training and validation datasets indicate its potential as a valuable tool for assisting ophthalmologists and optometrists in the diagnostic process. The model's ability to generalize across diverse datasets, albeit with a slight drop in performance on the external validation set, highlights the importance of using large and diverse datasets for training deep learning models for medical applications. The observed inconsistencies between datasets may be due to variations in image quality, acquisition protocols, and diagnostic criteria for glaucoma. The findings suggest that further improvements can be made by refining the model with more standardized and larger datasets. Grad-CAM visualizations provided valuable insights into the model's decision-making process, adding transparency to the classification process and enhancing trust among clinicians. The work represents a significant step towards developing more efficient, accessible, and accurate glaucoma screening methods, especially in settings with limited resources.
Conclusion
This study successfully developed and validated a generalized deep-learning model for glaucoma screening using fundus images. The model demonstrated high accuracy and generalizability, although performance varied slightly across datasets. Further research should focus on addressing dataset inconsistencies, incorporating additional clinical data (IOP, visual field data), and validating the model on larger, more diverse datasets, including underrepresented populations, to ensure robustness and reliability across various clinical settings. The findings highlight the promising potential of computer vision for improving global glaucoma screening.
Limitations
The study acknowledges several limitations. Inconsistencies in glaucoma classification criteria across the datasets may have affected the model's performance. Class imbalance in the dataset was addressed, but it could still impact the model's effectiveness. The model's external validation was limited to a single dataset, hindering a complete assessment of its generalizability. Data augmentation techniques used may not fully represent the variability in real-world clinical images. The study also notes an imbalance in the dataset's ethnic representation, with underrepresentation of certain groups which could have affected model performance and generalizability. The model did not integrate clinical data such as visual field information. Further refinement and testing on more extensive and diverse datasets are essential to fully realize the potential of this model for broader clinical application.
Related Publications
Explore these studies to deepen your understanding of the subject.