logo
ResearchBunny Logo
Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma

Medicine and Health

Dermatologist-like explainable AI enhances trust and confidence in diagnosing melanoma

T. Chanda, K. Hauser, et al.

Explore how a groundbreaking Explainable AI (XAI) system transforms melanoma diagnosis. Conducted by a team of experts including Tirtha Chanda and Katja Hauser, this study reveals how precise, domain-specific explanations enhance dermatologists' accuracy and trust in AI-powered tools.... show more
Introduction

Melanoma accounts for most skin cancer-related mortality, and early detection is crucial but challenging due to overlap with benign lesions. While deep neural network (DNN)-based diagnostic support improves accuracy in dermoscopy, lack of transparency impedes clinical adoption and conflicts with regulatory requirements (e.g., GDPR). Clinicians need case-level, domain-specific explanations to assess AI outputs and mitigate risks from spurious correlations. The study aims to develop a dermatology-aligned, explainable AI that closes the interpretability gap by localizing and naming clinically established dermoscopic features and to evaluate its effect on clinicians’ diagnostic accuracy, confidence, and trust in a controlled reader study.

Literature Review

Two main XAI branches are discussed: (1) post hoc explanations (e.g., CAM, Grad-CAM, LRP) that are broadly applicable but risk unfaithful or ambiguous explanations requiring user interpretation; and (2) inherently interpretable models (e.g., concept-based methods like TCAV, concept whitening) that can be more faithful but impose architectural/training constraints and often trade off with performance. Prior dermatology XAI work predominantly used post hoc methods; a 2022 review found only two studies with inherently interpretable methods, both providing dataset-level concept analyses rather than lesion-level clinical explanations. Recent attempts to bridge the interpretability gap include concept-vector systems trained on small, expert-annotated datasets and architectures with built-in localization and ontology-based textual explanations, but these lacked large-scale evaluations of clinician impact. The paper positions a multimodal, ontology-driven XAI to provide lesion-level visual and textual explanations tailored to dermatologists, addressing prior gaps in scale, faithfulness, and clinician-centered evaluation.

Methodology

Data and annotations: The HAM10000 dataset was used, restricting to biopsy-verified melanoma and nevus images (n=3611 images from 1981 lesions). Fourteen international board-certified dermatologists annotated each image with: (a) presence of ontology-defined characteristics for melanoma or nevus and (b) polygonal regions of interest (ROIs) localizing those characteristics. One dermatologist annotated all images; others annotated subsets so each lesion had ≥2 annotators. Ontology: A dermoscopy feature ontology (pattern analysis-based) compiled and validated by experts included melanoma criteria (e.g., thick reticular/branched lines, gray patterns, pseudopods/radial lines not entire circumference, peripheral black dots/globules, white lines/structureless white, etc.) and nevus criteria (e.g., single pattern/color, symmetric patterns/colors, monomorphic vessels, entire-circumference pseudopods/radial lines, etc.). Train/val/test split: From the base set, a held-out test set of 200 unique lesions (100 melanoma, 100 nevus) with one image per lesion was created; remaining unique lesions were split 82:18 into train (2646 images, 1460 lesions) and validation (599 images, 321 lesions), ensuring no patient/lesion leakage across splits. Model: A multimodal XAI classifier predicts ontology characteristics from dermoscopic images using a ResNet-50 backbone (ImageNet-pretrained) within an attention inference architecture. The diagnosis is inferred as melanoma if at least two melanoma characteristics are detected, approximating clinical 7-point checklist practice. Architecture has: (i) Comp_c (characteristic classifier) and (ii) Comp_a (guided attention) to align Grad-CAM attention with dermatologist ROIs. Loss combines cross-entropy on characteristics (L_c) with an attention/ROI alignment (Dice-based) loss (L_A) between Grad-CAM maps and annotated ROIs with weights λ_c=1, λ_A=10. Training used class balancing and augmentations (flips, color jitter, CLAHE, shifts/rotations, resize/normalize). Confidence calibration: Temperature scaling provided calibrated probabilities. Explanations: For phase 3, textual explanations list detected characteristics with certainty levels (“strong evidence” if calibrated output >0.7; otherwise “some evidence”) plus localized polygon ROIs drawn over top-20% attribution regions for those characteristics. If no feature exceeded 0.7, the most certain feature was shown. Reader study: Three-phase web-based study with 116 international clinicians (82 board-certified, 33 residents, 1 nurse consultant) on the test images. Phase 1 (no AI): Diagnose each image (nevus leave-in/excise or melanoma), select ontology characteristics, draw ROIs, and rate confidence (1–10). Phase 2 (AI support): Same images shown with AI diagnosis only; clinicians provided their diagnosis, confidence (1–10), and trust in AI (1–10). Phase 3 (XAI support): Same images with AI diagnosis plus textual and localized explanations and per-characteristic confidence; clinicians again provided diagnosis, confidence, and trust. At least two weeks separated phases; image order preserved; one image repeated for intra-rater variability. Statistical analysis: Balanced accuracy per participant; agreement rates with AI; Sørensen-Dice coefficients (DSC) for overlap between clinician- vs XAI-selected ontology characteristics and ROI overlaps (Grad-CAM vs human ROIs); explanation faithfulness via contrastive masking of important pixels; Grad-CAM inside/outside lesion attribution ratio; comparisons with a baseline ResNet-50 classifier trained without annotations and two SOTA methods (attention-based and ensemble). Tests: paired t-tests for phase comparisons; Wilcoxon signed-rank for nonnormal ratios; Mann–Whitney U for high- vs low-confidence effects; Spearman correlations for experience vs benefit and trust vs explanation overlap; bootstrap CIs with 10,000 samples, α=0.05, Bonferroni correction.

Key Findings
  • Diagnostic performance: XAI balanced accuracy 81% (95% CI: 75.6–86.3) vs baseline ResNet-50 80% (95% CI: 74.4–85.4); comparable to attention-based approach 79% (95% CI: 73.2–84.5) and ensemble 81.5% (95% CI: 76–86.7). Across eight backbones, XAI outperformed baseline in six, with ResNet-50 highest.
  • Focus on lesion vs background: Mean Grad-CAM attribution ratio inside vs outside lesion was 35.9 (95% CI: 30.7–42.0) for XAI vs 4.1 (95% CI: 3.4–4.7) for baseline (P<0.0001), indicating fewer spurious correlations.
  • Explanation faithfulness: Contrastive masking of important pixels reduced output scores, supporting explanation faithfulness (figure reported).
  • Alignment with clinicians: Ontological explanation overlap (DSC) between clinicians and XAI: 0.46 (95% CI: 0.44–0.48) when both predicted melanoma; 0.23 (95% CI: 0.20–0.26) when both predicted nevus; overall 0.27 (95% CI: 0.25–0.29; n=1089 images), comparable to between-clinician overlap 0.28 (95% CI: 0.27–0.29; n=5165 pairs). ROI overlap (DSC) clinician vs XAI: 0.48 (95% CI: 0.46–0.50) vs baseline 0.39 (95% CI: 0.38–0.41), P<0.0001 (n=1120 images).
  • Reader study accuracy: AI support vs no AI increased mean balanced accuracy from 66.2% (95% CI: 63.8–68.7) to 72.3% (95% CI: 70.2–74.3), P<0.0001 (n=109). Adding XAI explanations yielded 73.2% (95% CI: 71.0–75.3), a nonsignificant change vs AI-only (P=0.34; n=116). More experienced clinicians (regular scientific discussion) benefited more from XAI over AI (Spearman ρ=0.2, 95% CI: 0.02–0.37, P=0.03).
  • Agreement with AI: Human-AI diagnosis agreement increased from 77.1% (95% CI: 75–79.2) with AI-only to 79.5% (95% CI: 77.1–81.2) with XAI, mean +2.4 percentage points (95% CI: 0.65–4.2), P=0.009; on AI errors, agreement rose from 63.0% to 67.9% (mean +4.8 pp, 95% CI: −1.2 to 10.9, P=0.126).
  • Confidence: Clinician diagnostic confidence increased by 12.25% (95% CI: 9.06–15.74) with XAI vs AI-only, P<0.0001 (n=1714 image-level comparisons). In phase 3, clinician confidence was slightly higher for high- vs low-confidence AI predictions (means 7.82 vs 7.69; P=0.039); no significant difference in phase 2.
  • Trust: Trust in AI decisions increased by 17.52% (95% CI: 13.74–21.6) with XAI vs AI-only, P<0.0001 (n=1714). Trust depended strongly on clinician agreement with AI (means 7.55 vs 4.8; P<0.0001). AI confidence did not affect trust.
  • Trust–overlap correlation: When clinician and AI diagnoses agreed, trust correlated with ontological explanation overlap overall (ρ=0.087, 95% CI: 0.02–0.15, P=0.01; n=871); stronger for melanoma (ρ=0.23, 95% CI: 0.19–0.34, P<0.0001; n=567); negative for nevus (ρ=−0.10, 95% CI: −0.19 to −0.02, P=0.01; n=505), potentially due to frequent “melanoma simulator” explanations on nevi.
Discussion

The study shows that a dermatology-aligned XAI can preserve state-of-the-art diagnostic performance while providing faithful, localized, ontology-based explanations that align with clinicians’ reasoning and attention. Although XAI did not significantly improve diagnostic accuracy beyond AI support, it significantly increased clinicians’ diagnostic confidence and trust, key factors for adoption. Stronger lesion-focused attention and higher ROI overlap suggest learning of human-relevant features and reduced reliance on spurious context, challenging the presumed performance–interpretability trade-off. Experienced clinicians derived more benefit from XAI, while less experienced clinicians benefited primarily from AI’s raw predictions. Trust was positively associated with overlap of human and machine explanations (notably for melanoma), indicating that convergence in reasoning enhances perceived reliability. These findings support the clinical value of transparent, domain-specific explanations to meet regulatory expectations (GDPR) and to foster responsible human–AI collaboration.

Conclusion

This work introduces and evaluates a multimodal, ontology-driven XAI for dermoscopic differentiation of melanoma vs nevus that delivers dermatologist-like textual and localized visual explanations alongside calibrated confidence. The system achieves competitive accuracy, focuses attention within lesions, and aligns with clinician-selected features and ROIs. In a large reader study, XAI significantly increased clinicians’ diagnostic confidence and trust compared to AI-only support, though it did not further improve accuracy. The authors release an expert-annotated explanations dataset and open-source code to catalyze further research. Future directions include evaluating individual explanation components, addressing domain shift with multi-center data and adaptation methods, refining explanation design for clinical workflows, and exploring impacts across varying clinician expertise levels and error scenarios.

Limitations
  • Evaluation occurred under artificial reader-study conditions rather than routine clinical practice.
  • The model is explicitly guided to produce human-like, ontology-based explanations, potentially foregoing non-human-observable predictive patterns.
  • Tight coupling to a domain-specific ontology limits applicability where no standardized descriptive framework exists or where multiple terminologies coexist, which may alienate some users.
  • Effects are measured for a combined multimodal explanation; contributions of individual components (text vs localization vs confidence) were not isolated.
  • The polygon ROI visualization, threshold choices, and non-interactive display may be suboptimal for clinical use and degenerate cases.
  • Domain shift was not addressed; generalizability to images from other institutions and acquisition settings requires validation and possibly domain adaptation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny