logo
ResearchBunny Logo
Introduction
The use of deep learning systems (DLS) in medical image diagnosis shows promise in augmenting clinical decision-making. However, the effectiveness of physician-machine partnerships is still an open question, particularly concerning the diagnosis of underrepresented populations where both physicians and algorithms are prone to systematic errors. Existing research indicates that DLS-based clinical decision support can improve diagnostic accuracy for generalists in teledermatology, but questions remain about its performance across different physician expertise levels and skin tones. Racial bias in medicine is well-documented, impacting dermatology through underrepresentation in textbooks, residency programs, research, and training data for machine learning algorithms. While deep learning models offer potential for enhancing clinical decision-making, algorithmic audits reveal systematic errors, especially in diagnosing skin diseases on darker skin tones. This study aimed to evaluate the diagnostic accuracy of specialist and generalist physicians across skin tones in a simulated store-and-forward teledermatology setting and assess how a physician-machine partnership could affect accuracy disparities across skin tones. The study leverages digital experiments in social sciences, recruiting a large number of physicians through online platforms and employing gamified behavioral experiment insights to improve engagement. The methodology follows algorithmic auditing principles to systematically evaluate errors and bias in the DLS and physician performance.
Literature Review
Previous research has demonstrated the potential of deep learning systems for improving diagnostic accuracy in dermatology, particularly for generalist physicians. However, concerns remain regarding the potential for bias in these systems, especially concerning the diagnosis of skin diseases in individuals with darker skin tones. Studies have shown that existing dermatology datasets and algorithms often lack representation of diverse skin tones, leading to performance disparities. Recent efforts have focused on increasing transparency and addressing these disparities by annotating images with Fitzpatrick skin types, developing diverse datasets, and creating synthetic images. However, the impact of these efforts on physician performance and the interaction between physicians and AI in diagnosing skin conditions across skin tones remains largely unexplored. This study builds upon existing research in store-and-forward teledermatology and algorithmic auditing to address these critical gaps.
Methodology
A custom digital experiment was designed to evaluate physician diagnostic accuracy on images of inflammatory skin diseases, mimicking store-and-forward teledermatology. 364 images of 46 skin diseases were curated, with a near-uniform distribution across skin tones (estimated Fitzpatrick Skin Type, FST). The images were primarily of eight common diseases (atopic dermatitis, cutaneous T-cell lymphoma (CTCL), dermatomyositis, lichen planus, Lyme disease, pityriasis rosea, pityriasis rubra pilaris, and secondary syphilis). Participants were 1118 physicians from 39 countries, including 389 board-certified dermatologists (BCDs) and 459 primary care physicians (PCPs). Participants were randomly assigned to conditions involving two versions of a DLS (a control and a synthetically enhanced ‘treatment’ DLS with higher accuracy) and two clinical decision support interfaces. The experiment involved a pre-survey, image-based differential diagnosis task (up to three diagnoses), and DLS-based decision support where physicians could choose to integrate the DLS suggestion or not. The experiment used a gamified design to increase engagement and provided feedback after each trial. Diagnostic accuracy was measured using top-1, top-3, and top-4 accuracy (including DLS suggestions if accepted). Analyses examined accuracy differences across skin tones, physician experience, and DLS assistance, using ordinary least squares regression with robust standard errors clustered at the participant level.
Key Findings
A total of 14,261 differential diagnoses were collected. Without DLS assistance, BCDs achieved a top-3 accuracy of 38% and PCPs 19%. Both groups showed a 4 percentage point lower accuracy for darker skin (FST 5-6) compared to lighter skin (FST 1-4). The DLS significantly improved diagnostic accuracy. With the control DLS, BCDs' top-1 accuracy increased from 27% to 36%, and PCPs' from 13% to 22%. Top-4 accuracy increased from 37% to 60% for BCDs and from 17% to 47% for PCPs. However, the DLS exacerbated the accuracy gap for PCPs on darker skin. PCPs who reported seeing mostly white patients were 7 percentage points less accurate on darker skin images. The study also found differences in referral rates for biopsy and second opinions across skin tones. BCDs referred common diseases for biopsy at higher rates for darker skin and less often for CTCL in darker skin, while PCPs were more likely to refer darker-skinned patients for a dermatologist’s opinion. The user interface design (order of buttons) significantly impacted whether physicians updated their top prediction with the DLS suggestion.
Discussion
The findings demonstrate significant disparities in diagnostic accuracy for skin diseases across skin tones, highlighting the need for more training and awareness among physicians. The study also shows that while DLS can significantly improve overall diagnostic accuracy, it doesn’t automatically address existing biases. The exacerbation of accuracy gaps for PCPs using the DLS highlights the potential for AI to worsen existing inequalities if not carefully designed and implemented. The lower-than-expected accuracy rates for specialists in this simulated teledermatology setting suggest that the limitations of a single image without additional clinical context significantly reduce diagnostic accuracy. The study underscores that physician-machine partnerships need careful design to avoid exacerbating biases and to support physicians, especially in challenging diagnostic scenarios.
Conclusion
This study provides crucial evidence on the limitations of current dermatology diagnostic practices and the potential for AI to both improve and exacerbate existing biases. The results emphasize the need for continued research to develop fair and equitable AI tools, along with improved training for physicians on diagnosing skin diseases in individuals with diverse skin tones. Future research should focus on addressing bias in AI algorithms, exploring different methods for integrating AI decision support into clinical workflows, and further investigate how collective human intelligence can improve diagnostic accuracy and mitigate bias.
Limitations
The study used a simulated teledermatology setting, which lacks the richness of information available in in-person clinical visits. The reliance on a single image for diagnosis might not fully capture the complexity of clinical decision-making. The use of estimated FST labels may also introduce some limitations in assessing skin tone. The convenience sample of physicians, while large, might not fully represent the global population of physicians.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny