Melanoma, a deadly form of skin cancer, is increasing globally. Early detection significantly improves patient outcomes. Traditional diagnosis relies on visual inspection and dermoscopy, with accuracy varying based on dermatologist expertise. The rising demand for dermatologists and the challenges in diagnosing atypical cases necessitate innovative diagnostic approaches. While retrospective studies have shown promise for AI in melanoma detection, prospective studies with large, diverse datasets are lacking to confirm these findings and assess generalizability. This study addresses this gap by prospectively evaluating an established AI algorithm in a real-world clinical setting.
Literature Review
Numerous retrospective studies have reported that AI algorithms can match or even surpass the diagnostic accuracy of experienced dermatologists in artificial settings for melanoma detection. However, these studies often suffer from limitations such as small sample sizes, homogeneous datasets, and the exclusion of rare melanoma subtypes, hindering a comprehensive evaluation of AI's generalizability for clinical application. The need for prospective studies evaluating AI's performance in diverse real-world settings with larger datasets and representation of various melanoma subtypes was highlighted by previous research.
Methodology
This prospective, multicenter study involved eight German university hospitals. After informed consent, six dermoscopic images of each melanoma-suspicious lesion were captured, introducing deliberate variations in orientation, position, and dermatoscope settings. Lesions were histopathologically diagnosed by experienced pathologists. The study used the ADAE algorithm, a pre-trained binary melanoma classifier that had performed well in previous challenges. Real test-time augmentation (R-TTA) was implemented by averaging predictions from multiple images per lesion. The study compared ADAE's performance to that of dermatologists using balanced accuracy, sensitivity, and specificity, utilizing a pairwise two-sided Wilcoxon signed-rank test for statistical analysis. The dataset included images from 1716 patients with 1910 skin lesions, encompassing various melanoma subtypes, anatomical locations, and camera setups. Patient characteristics such as age, sex, Fitzpatrick skin type, and lesion size were recorded. Subgroup analyses explored performance differences based on factors like patient age, lesion location, lesion subtype, and dermatologist confidence in their diagnosis.
Key Findings
ADAE showed higher overall balanced accuracy (0.798) than dermatologists (0.781; p = 4.02e-145). ADAE exhibited significantly higher sensitivity (0.922 vs. 0.734; p = 3.38e–165) but lower specificity (0.673 vs. 0.828; p = 3.3e–165) at an 85% sensitivity threshold. ADAE significantly outperformed dermatologists in detecting nodular melanoma and showed higher specificity for certain nevi types. Performance varied across hospitals, with ADAE performing significantly worse in one hospital (hospital 3) due to factors such as lesion size, pigmentation, patient age, and skin type distribution. ADAE performed better than dermatologists on lesions where dermatologists expressed lower confidence, suggesting a synergistic potential. Lesion diameter and pigmentation influenced ADAE's performance more than dermatologists' performance. The algorithm's performance was generally consistent across various technical domains and data sources, except for hospital 3 where it performed significantly worse.
Discussion
This study demonstrates that the ADAE algorithm can outperform dermatologists in terms of balanced accuracy and sensitivity in a prospective, multicenter setting using a large, heterogeneous dataset. The higher sensitivity indicates a reduced risk of missing melanomas. However, the lower specificity highlights the increased risk of false positives. The performance differences across subgroups, especially the underperformance in hospital 3, underscore the need for further algorithm refinement to address factors such as lesion size, pigmentation, patient age, and skin type. The algorithm's superior performance in cases where dermatologists were less confident suggests a potential for synergistic use—AI could assist dermatologists in difficult cases, potentially improving diagnostic accuracy. Future studies should focus on improving specificity and addressing the algorithm's limitations.
Conclusion
The ADAE algorithm shows promising potential for supporting dermatologists in melanoma diagnosis, especially in challenging cases. Its strengths in sensitivity and performance on difficult cases, when combined with dermatologist expertise, can lead to improved early detection of melanoma. However, future research should focus on improving specificity, addressing performance variations across different patient populations and hospitals, and enhancing the explainability of the algorithm to ensure trust and acceptance in clinical practice.
Limitations
The study's limitations include the geographic restriction to German hospitals, potentially limiting generalizability to other populations with different ethnicities and skin types. The binary classification (melanoma vs. non-melanoma) simplifies the clinical reality. Further investigation is needed into the algorithm's explainability and the impact on clinical workflow and decision-making. The computing costs associated with the ensemble method weren't evaluated.
Related Publications
Explore these studies to deepen your understanding of the subject.