Medicine and Health

Prospective multicenter study using artificial intelligence to improve dermoscopic melanoma diagnosis in patient care

Not Available

This groundbreaking multicenter study conducted by the authors showcases the ADAE algorithm's impressive diagnostic accuracy for melanoma detection, outperforming dermatologists in balanced accuracy and sensitivity. The research, spanning eight hospitals and incorporating various camera setups, opens new avenues for AI in supporting dermatological practices, particularly in complex cases.... show more

Introduction

Melanoma is the leading cause of skin cancer mortality and has increased in incidence. Early detection reduces surgical extent and the need for systemic therapies. Diagnosis traditionally relies on clinical examination and dermoscopy, with accuracy dependent on clinician expertise. Given rising demand for dermatologic expertise and limited availability, new diagnostic tools are needed, especially for atypical or uncommon cases. AI systems have shown dermatologist-level performance in retrospective settings, but prospective validations remain scarce and often limited by small, homogeneous datasets and lack of rare subtypes. This study prospectively evaluates an established AI algorithm (ADAE) for melanoma detection in a multicenter, heterogeneous clinical setting to assess diagnostic accuracy and generalizability compared with dermatologists.

Literature Review

Methodology

Study design: Prospective multicenter study across eight German university hospitals with IRB approvals (e.g., TU Dresden BO-EK-53012021; FAU Erlangen-Nürnberg 29_21 B; Univ. Duisburg-Essen 202-9784-BO; Univ. Regensburg 2010-318N-M and 20-210-19; LMU Munich 21-0128; Univ. Würzburg 293/20-2). Conduct adhered to the Declaration of Helsinki. Charité Berlin did not require specific IRB because data collection followed established clinical codes. Participants: Adults (≥18 years) with clinically melanoma-suspicious lesions. Exclusions: prior biopsy of the target lesion, lesions near the eye or beneath the granulation/areola, or presence of person-identifying features (e.g., tattoos) near the lesion. Data collection and imaging: After informed consent and dermatoscopic examination, melanoma-suspicious lesions were excised and histopathologically diagnosed by at least one experienced (dermato)pathologist at each hospital (reference standard). For each lesion, dermatologists captured six dermoscopic images with deliberate random variation in orientation/angle, position, and dermatoscope mode (polarized and nonpolarized), while avoiding known artifacts (e.g., skin markings). Four distinct hardware configurations were used consistently across centers. Algorithm and evaluation: ADAE, a ready-to-use open-source ensemble binary classifier for melanoma vs. non-melanoma (ranked first in SIIM-ISIC Challenge 2020), was used without additional training. Real test-time augmentation (R-TTA) was applied by classifying each image of a lesion individually and aggregating (averaging) predictions to produce a single per-lesion prediction. Dermatologists’ clinical diagnoses (with access to metadata) were recorded alongside self-rated diagnostic confidence (1–5). Primary outcome: balanced accuracy; secondary outcomes: sensitivity and specificity. A predefined operating point targeting 85% sensitivity (per Methods) was used for the AI. Statistical analysis used pairwise two-sided Wilcoxon signed-rank tests with p<0.05 considered significant. Subgroup analyses assessed performance across hospitals, technical setups, patient age, sex, lesion location, diameter, Fitzpatrick skin type, melanoma subtype, and confidence ratings of both dermatologists and AI. Dataset: Prospectively collected images from April 2021 to March 2023 included 1910 lesions (11460 images; six per lesion) from 1716 patients. The test/evaluation set summarized in results included 1571 lesions (653 melanomas; 918 non-melanomas), encompassing rare melanoma subtypes (e.g., Spitz, desmoplastic) and special anatomical sites.

Key Findings

Overall performance (1571 lesions; 653 melanomas; 918 non-melanomas):

Balanced accuracy: ADAE 0.798 (95% CI 0.779–0.814) vs. dermatologists 0.781 (95% CI 0.760–0.802); highly significant.
Sensitivity: ADAE 0.922 (95% CI 0.900–0.942) vs. dermatologists 0.734 (95% CI 0.701–0.770).
Specificity: ADAE 0.673 (95% CI 0.641–0.702) vs. dermatologists 0.828 (95% CI 0.804–0.852).
Detections: ADAE detected 602/653 melanomas (0.922); dermatologists detected 479/653 (0.734). Either AI or dermatologist detected 623/663 melanomas (0.954). ADAE correctly classified 618/918 non-melanomas (0.673); either AI or dermatologist correctly identified 833/918 non-melanomas (0.907). Subgroup highlights:
Age <35 years: Balanced accuracy ADAE 0.890 vs. dermatologists 0.767.
Head/neck lesions: ADAE 0.775 vs. dermatologists 0.660.
Palms/soles: ADAE 0.649 vs. dermatologists 0.798 (dermatologists better).
Hospitals: ADAE outperformed dermatologists in balanced accuracy at 5/7 hospitals; performed worse at Hospital 1 and notably worse specificity at Hospital 3 (specificity 0.333 for ADAE at Hospital 3 vs. 0.593 dermatologists), though sensitivity at Hospital 3 remained high (ADAE 0.897 vs. dermatologists 0.923).
Dermatologists’ confidence: ADAE surpassed dermatologists on cases with low to moderate dermatologist confidence (ratings 1–4), but lagged when dermatologist confidence was 5.
AI confidence: When AI confidence ≥2, ADAE’s balanced accuracy exceeded dermatologists’; when AI confidence =1, ADAE underperformed.
Diameter/pigmentation: ADAE performed worse for lesions <6 mm and for non-pigmented lesions, whereas dermatologist performance was more consistent across sizes and pigmentation.
Melanoma subtypes: ADAE showed significantly higher sensitivity for nodular melanoma than dermatologists; higher specificity noted for certain benign subtypes (e.g., dysplastic/Clark nevi, acral nevus) in specific analyses.

Discussion

ADAE demonstrated higher balanced accuracy and markedly higher sensitivity than dermatologists in a challenging, prospectively collected multicenter dataset of melanoma-suspicious lesions, at the expense of lower specificity. These results suggest AI can mitigate missed melanomas, a clinically critical outcome, though with increased false positives. Complementary strengths emerged: ADAE performed better in younger patients and head/neck lesions and excelled when dermatologists reported low confidence, indicating potential for AI to support clinicians in difficult cases. Conversely, dermatologists outperformed ADAE on palms/soles and certain domains (e.g., Hospital 3), underscoring domain shifts related to lesion size, pigmentation, and demographic/technical factors that can degrade AI specificity. The complementary error patterns imply a synergistic workflow—combining AI triage or decision support with dermatologists’ expertise—may improve overall detection while balancing false positives. Findings align with prior work showing benefits from AI–clinician collaboration and highlight the importance of evaluating generalization across clinical and technical domains.

Conclusion

In a prospective multicenter clinical setting, ADAE achieved higher balanced accuracy and substantially higher sensitivity than dermatologists but lower specificity. These complementary strengths suggest AI could serve as effective decision support, particularly in difficult or low-confidence cases, to enhance melanoma detection. Future work should improve specificity, address generalization across diverse populations and technical settings, incorporate explainability to meet regulatory and user needs, assess computational costs, and evaluate real-world impact on clinical management and outcomes.

Limitations

Generalizability: All centers were in Germany; skin types V–VI were underrepresented, limiting applicability to more diverse populations.
Task scope: Binary classification (melanoma vs. non-melanoma) does not reflect the full spectrum of clinical differentials.
Metrics and operational aspects: The study focused on diagnostic accuracy; it did not assess downstream management decisions, workflow impact, computing costs, or explainability.
Domain shift sensitivity: Performance varied by hospital, lesion size (<6 mm), pigmentation (non-pigmented), and certain anatomical sites (palms/soles), with notably reduced specificity in one center, indicating susceptibility to technical and population differences.
Potential dataset imbalances and rare subtype distribution may influence subgroup estimates.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare

K. H. Goh, L. Wang, et al.

Medicine and Health

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

D. Kiyasseh, J. Laca, et al.

Medicine and Health

What Is in There for Artificial Intelligence to Support Mental Health Care for Persons with Serious Mental Illness? Opportunities and Challenges

B. Wang, C. K. Grønvik, et al.

Psychology

What Is in There for Artificial Intelligence to Support Mental Health Care for Persons with Serious Mental Illness? Opportunities and Challenges

B. Wang, C. K. Grønvik, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny