Medicine and Health
Deep learning-aided decision support for diagnosis of skin disease across skin tones
M. Groh, O. Badri, et al.
Discover groundbreaking insights into the diagnostic capabilities of dermatologists and primary care physicians as they navigate the complexities of skin disease across various skin tones. Conducted by a diverse team of researchers including Matthew Groh, Omar Badri, and Roxana Daneshjou, this study reveals both the potential of deep learning systems and the persistent accuracy gaps that demand our attention.
~3 min • Beginner • English
Introduction
The study investigates how physician–machine partnerships affect diagnostic accuracy in dermatology, with a focus on potential disparities across skin tones. Motivated by evidence that both clinicians and algorithms can make systematic errors and that human–AI collaboration is not guaranteed to outperform either alone, the authors ask: (1) What is the baseline diagnostic accuracy of specialists and generalists in a store-and-forward teledermatology setting across diverse skin tones? (2) How does deep learning system (DLS) decision support change physicians’ diagnostic performance overall and across skin tones? (3) Can DLS assistance reduce or exacerbate accuracy disparities between light and dark skin presentations? The work is important because dermatology datasets, training materials, and clinical experience often underrepresent darker skin tones, potentially leading to biased diagnostic performance and inequities in care. The study provides an empirically grounded evaluation using a large-scale, randomized digital experiment and algorithmic auditing framework.
Literature Review
Prior research shows that human–AI collaboration in medical diagnosis can underperform when clinicians have difficulty calibrating trust in opaque algorithms or when models harbor hidden biases. Dermatology has documented racial and skin tone representation gaps in textbooks, residency programs, and research datasets, with evidence that non-specialists’ diagnostic performance is lower for darker skin. Algorithmic audits of dermatology DLSs reveal systematic errors, especially on dark skin, prompting efforts to improve transparency (e.g., Fitzpatrick skin type annotations), curate more diverse datasets, and generate synthetic images to fill gaps. While early teledermatology studies suggest DLS-based support can aid generalists, open questions remain regarding performance across expertise levels and skin tones in realistic, image-only, store-and-forward contexts. This study builds on algorithmic auditing and physician error literature to systematically measure accuracy and bias in physician–machine partnerships.
Methodology
Design: A large-scale, randomized digital experiment simulating store-and-forward teledermatology. Participants (board-certified dermatologists, dermatology residents, primary-care physicians, and other physicians) viewed images and provided up to three free-text differential diagnoses, then received DLS-based decision support and could choose to (a) keep their differential, (b) add the suggested diagnosis, or (c) replace their top diagnosis with the suggestion. Outcomes were measured as top-1 (leading diagnosis), top-3 (any of three initial differentials), and top-4 (any of the three differentials plus included DLS suggestion) accuracies. Disparities across skin tones used estimated Fitzpatrick skin types (FST 1–4 light; FST 5–6 dark).
Stimuli: 364 images covering 46 skin diseases, with emphasis on eight main inflammatory-appearing conditions (atopic dermatitis, CTCL, dermatomyositis, lichen planus, Lyme disease/erythema migrans, pityriasis rosea, pityriasis rubra pilaris, secondary syphilis). At least 29 images per main disease; 78% of images drawn from these eight. Images were approximately balanced across FSTs. Extensive quality control included selection and review by three board-certified dermatologists (BCDs), removal of low-quality or questionable-label images, and post hoc review of most/least accurately diagnosed images. Additional 80 images came from Fitzpatrick 17k for ecological validity.
Participants and recruitment: 1,118 physicians contributed 14,261 differentials overall; core accuracy analyses focus on first ten differentials from those passing an attention check (acne image) and completing ≥10 images: 296 BCDs (2,660 differentials), 83 residents (747), 350 PCPs (3,150), 113 other physicians (1,015). Physicians resided in 39 countries; recruitment was via Sermo and outreach; modest honoraria were offered.
Randomization: Two orthogonal randomizations: (1) DLS accuracy condition—control DLS (VGG-16 fine-tuned on 31,219 clinical images; 47% top-1 overall on nine labels; fair across FST) versus a Wizard-of-Oz enhanced treatment DLS synthetically correcting 65% of control errors (84% top-1), also fair across FST. (2) Interface order—control UI with buttons ordered [Update top prediction, Update differential, Keep], versus treatment UI reversing the order.
DLS details: Control model classified nine labels (the eight main diseases + “other”); architecture VGG-16 pretrained on ImageNet, fine-tuned on diverse clinical dermatology images with standard augmentations; evaluated only on study images. Treatment DLS preserved the same predictions with synthetic accuracy boost. Both showed similar performance across light and dark skin on main diseases (control: 56% light vs 58% dark; treatment: 84% light vs 82% dark).
Procedure: After a pre-survey and instructions noting that the AI is imperfect (without disclosing exact accuracy), participants diagnosed images via free-text fields with predictive text over 445 terms to standardize synonyms and spelling variants. Confidence could be set via a slider. Non-dermatologists indicated biopsy and dermatology referral preferences; BCDs indicated biopsy referral. After initial submission, physicians saw the DLS suggestion and chose one of three actions. Feedback on the reference label and common incorrect diagnosis was given after each trial; aggregate performance feedback appeared after ten images to encourage engagement.
Metrics and analysis: Accuracy measured as top-1, top-3, and top-4. Skin tone disparities assessed by comparing performance on FST 1–4 vs 5–6. Ordinary least-squares regressions with robust standard errors clustered at participant level examined effects of expertise, skin tone, DLS assistance, and their interactions; additional models included interface-order interactions. Sensitivity and specificity were computed for the main eight diseases. Robustness checks included subsets (US-only, fewer than ten responses, all passing the attention check). Ethical approval was obtained (MIT, Exempt Category 3). Data and code availability were provided.
Key Findings
- Baseline diagnostic accuracy:
- Top-3 accuracy (all images excluding attention check): BCDs 38%, residents 36%, PCPs 19%, other physicians 18%. For the eight main diseases: BCDs 37%, residents 35%, PCPs 17%, other 16%.
- Top-1 accuracy: BCDs 27%, residents 24%, PCPs 14%, other 13% (similar rates on the eight main diseases).
- Confidence positively correlates with accuracy for BCDs and PCPs (Pearson r ≈ 0.14–0.17 among non-default confidences).
- Disparities across skin tones (FST):
- Across all participants, dark skin (FST 5–6) was diagnosed four percentage points lower than light skin (FST 1–4) for both top-1 (P<0.001) and top-3 (P=0.001) accuracies.
- By group, top-1 accuracy on dark vs light was lower by: BCDs −5 pp (P=0.011), residents −5 pp (P=0.114), PCPs −3 pp (P=0.006), other −5 pp (P=0.012). Top-3 showed similar direction with varying significance.
- Disease-specific differences: For BCDs, significant top-1 drops on dark skin for atopic dermatitis (−18 pp, P=0.007), Lyme (−20 pp, P<0.001), pityriasis rosea (−19 pp, P=0.001), and CTCL (−10 pp, P=0.009). For PCPs, significant disparities in top-3/top-1 for atopic dermatitis, Lyme, and pityriasis rosea.
- Moderators for PCPs: Those seeing mostly/all white patients had a −7 pp top-3 gap on dark vs light (P=0.009). PCPs reporting sufficient training were +5 pp more accurate on dark vs light images than those reporting insufficient training (P=0.079). No significant moderation by experience years.
- Referral behaviors:
- BCD biopsy referral: 28% overall. PCP biopsy referral: 7%; PCP referral to dermatology: 28%.
- BCDs referred biopsy more often in dark vs light for atopic dermatitis (P=0.008) and pityriasis rosea (P=0.015), but less often for pityriasis rubra pilaris (P=0.033) and CTCL (P=0.001). PCP biopsy referral differences were generally non-significant except CTCL (P=0.011). PCPs were 4.4 pp more likely to refer dark skin cases to dermatologists (P=0.012).
- Effect of DLS assistance:
- Top-1 accuracy on main eight diseases increased with control DLS from 27%→36% for BCDs (P<0.001; +33% relative) and from 13%→22% for PCPs (P<0.001; +69% relative).
- Top-3→Top-4 gains: BCDs 37%→60%; PCPs 17%→47%. If DLS suggestion replaced the third differential, top-3 would be 59% (BCD) and 46% (PCP).
- Physicians updated their differential more often with the enhanced DLS than control (BCD 47% vs 40%, P=0.009; PCP 61% vs 54%, P=0.001) on images with identical DLS predictions across conditions.
- When DLS was incorrect, top-1 accuracy decreases were minimal (−1.2 pp for both BCDs and PCPs; not significant). Overriding a correct leading diagnosis with an incorrect suggestion occurred in <2% of observations. If all three physician differentials were incorrect, physicians adopted the incorrect suggestion as top diagnosis in 10% (BCD) and 14% (PCP) of cases.
- Sensitivity improved with DLS while specificity remained high (e.g., BCD CTCL sensitivity 18%→26% with control DLS, P=0.039, and to 31% with enhanced DLS, P=0.001; specificity ~99% throughout). Similar trends observed for PCPs.
- BCDs without DLS were 5 pp more accurate (top-1) than PCPs with control DLS (P<0.001), but 4 pp less accurate than PCPs with enhanced DLS (P=0.022).
- Regression results: Specialist status +13 pp (top-1, P<0.001); dark skin −3 pp (top-1, P=0.006); DLS assistance +8 pp (top-1, P<0.001), with enhanced DLS adding +8 pp more (P=0.002). DLS assistance increased PCP accuracy disparities across skin tones by ~5 pp (top-1 P=0.008; top-4 P=0.048); no significant disparity increase for BCDs.
- User interaction effects:
- Reversing button order (putting “Keep my differential” on top) shifted choices: +9 pp to “Update my differential” and −9 pp to “Update my top prediction” (both P<0.001). Partnerships in the treatment interface had lower top-1 accuracy (BCD −12 pp, P<0.001; PCP −7 pp, P=0.011) due to reduced willingness to replace the top diagnosis.
Discussion
The study establishes baseline diagnostic performance of specialists and generalists on a realistic, image-only, store-and-forward task and shows consistent, statistically significant accuracy gaps for dark versus light skin presentations. DLS-based decision support substantially improves physician accuracy and sensitivity with minimal harm to specificity and limited adoption of incorrect suggestions. However, assistance can exacerbate existing disparities, especially among generalists, highlighting that improving overall performance does not automatically correct bias. User interface design also meaningfully affects how clinicians incorporate advice, with button order influencing willingness to change the top diagnosis and thus overall accuracy. These findings underscore the need to evaluate physician–AI systems not only by aggregate accuracy but also by subgroup performance and interaction design. Clinically, enhanced triage and earlier recognition of serious conditions (e.g., CTCL) may be achievable with well-calibrated AI support, yet targeted training and fairness-aware design are essential to avoid worsening inequities across skin tones.
Conclusion
This work contributes a large-scale, randomized evaluation of physician diagnostic accuracy across skin tones and the impact of DLS decision support in a store-and-forward teledermatology setting. Key contributions include: (1) quantifying baseline specialist and generalist accuracy and documenting significant performance gaps on dark skin; (2) demonstrating that DLS assistance can substantially improve accuracy and sensitivity with stable specificity; (3) showing that assistance can widen disparities for generalists; and (4) revealing that simple UI design choices affect how clinicians integrate AI suggestions. Future research should test these findings in clinical environments with richer context, develop fairness-aware assistance that avoids exacerbating disparities, compare AI assistance with collective human intelligence supports, and explore training interventions that improve diagnostic accuracy on darker skin presentations.
Limitations
- Ecological constraints: The image-only, store-and-forward simulation with free-text responses lacks the additional history, physical exam context, and image quality control of in-person care; results may not generalize to full clinical encounters.
- Dataset scope: Images were curated from varied sources; while vetted by BCDs, non-neoplastic dermatologic diagnoses are not histopathologically confirmed by necessity. Some diseases and presentations (classic vs atypical) may be unevenly represented.
- Skin tone labeling: FST is an imperfect proxy for skin tone; although expert-adjusted, it may not capture phenotypic diversity fully.
- Participant sampling: Convenience sampling via Sermo and outreach may introduce self-selection and limit representativeness across regions and practice types.
- DLS constraints: The control model classified only nine labels; the enhanced model used a Wizard-of-Oz synthetic accuracy boost. Reported effects reflect decision support dynamics rather than a fully deployable diagnostic tool.
- Limited demographics: Physician sex/gender and other characteristics were not collected; patient demographics cannot be reliably inferred from images. Potential unmeasured confounders remain.
- Interface effects: UI manipulation changed adoption of AI suggestions; other design factors (e.g., explanation content) were not explored.
Related Publications
Explore these studies to deepen your understanding of the subject.

