logo
ResearchBunny Logo
Skin Tone Analysis for Representation in Educational Materials (STAR-ED) using machine learning

Medicine and Health

Skin Tone Analysis for Representation in Educational Materials (STAR-ED) using machine learning

G. A. Tadesse, C. Cintas, et al.

The STAR-ED framework reveals a stark underrepresentation of dark skin tones in medical education materials, potentially leading to diagnostic disparities in skin diseases across racial groups. This innovative research, conducted by a diverse team of authors including Girmaw Abebe Tadesse and Celia Cintas from IBM Research - Africa, utilizes machine learning to quantify these imbalances.... show more
Introduction

Medical education materials (textbooks, lecture notes, journal articles) lack adequate representation of darker skin tones, especially Fitzpatrick skin types (FST) V–VI. Because skin disease can present differently across skin tones, insufficient representation may contribute to delayed diagnoses and worse outcomes for patients of color. Prior analyses documenting this bias relied on labor-intensive manual image review, which does not scale and is prone to labeling variability. This work proposes STAR-ED, an automated machine learning framework to ingest academic materials, identify skin images, segment skin pixels, and estimate skin tone categories (FST I–IV vs. FST V–VI) to quantify representation at scale.

Literature Review

Previous studies have manually evaluated dermatology-related educational materials and found underrepresentation of FST V–VI. During the COVID-19 pandemic, published images of cutaneous manifestations also underrepresented dark skin. Automated skin tone analysis has largely focused on curated datasets (e.g., ISIC, SD-19) rather than real-world educational materials. ITA (individual typology angle) computed from pixel intensities has been used to infer FST, but ML models trained directly to classify FST outperform ITA-based mappings and are less sensitive to lighting. Prior work also shows curated dermatology datasets underrepresent dark skin tones, underscoring the need for automated, scalable bias assessment tools applicable to heterogeneous educational content.

Methodology

STAR-ED is an end-to-end pipeline comprising: (1) document ingestion, (2) skin image detection, (3) skin pixel segmentation, and (4) skin tone estimation.

  • Document ingestion: The Corpus Conversion Service (CCS) parses PDFs (and other formats) to structured JSON, extracting images, tables, captions, page numbers, and coordinates.
  • Skin image detection: For each extracted image, features include Histogram of Oriented Gradients (HOG; 32-bin orientation histograms) and color statistics in CIE LAB (means and standard deviations of L, a, b). The final 38-D feature vector (HOG + LAB stats) is classified using SVM (RBF kernel; nu=0.01, gamma=0.05) and XGBoost (hyperparameters selected via 3-fold CV). Performance metrics: AUROC and F1. Training/validation used five-fold stratified CV on the DermEducation dataset, with external testing on images extracted from four textbooks.
  • Skin pixel segmentation: A color-based approach segments skin pixels, masking non-skin foreground/background (e.g., clothing, equipment). RGB images are converted to HSV and YCbCr with published skin ranges; region growing, watershed, and morphological operations are applied. Validation uses a 22-image SegmentedSkin dataset with dermatologist-created masks (healthy skin masks exclude lesions), reporting pixel-level rates and Jaccard index.
  • Skin tone estimation: Skin-only pixels from segmented images are used to classify skin tone into two groups: FST I–IV vs. FST V–VI. Two families of approaches were evaluated on Fitzpatrick17K with stratified five-fold CV: (a) traditional ML on engineered features (HOG + ITA + CIE LAB statistics) using ensemble methods (Random Forest, Balanced Random Forest, Extremely Randomized Trees, AdaBoost, Gradient Boosting), and (b) a deep learning model: pretrained ResNet-18 (11,689,512 parameters) fine-tuned on Fitzpatrick17K with weighted cross-entropy, SGD optimizer, 20 epochs, learning rate 1e-3 (linear decay), batch size 32. Data splits for Fitzpatrick17K: 70% train, 10% validation, 20% test. External testing used the four textbooks and DermEducation. Software: scikit-learn v0.24.2, imbalanced-learn, PyTorch v1.8.1, SciPy/NumPy stack.
Key Findings
  • Skin image detection: On DermEducation (5-fold CV), XGBoost achieved F1 = 0.96 ± 0.008 and AUROC = 0.95 ± 0.013; SVM performed similarly. External test on four textbooks: average AUROC = 0.96 ± 0.02 and F1 = 0.90 ± 0.06 across books.
  • Skin pixel segmentation (vs. expert masks on SegmentedSkin): average false positive rate 0.24, false negative rate 0.05, true positive rate 0.36, true negative rate 0.34, Jaccard index 0.51, accuracy 0.70.
  • Skin tone estimation on Fitzpatrick17K (cross-validation): pretrained ResNet-18 (STAR-ED) using masked pixels achieved accuracy 0.90 ± 0.00, F1 0.91 ± 0.00, precision 0.91 ± 0.00, AUROC 0.77 ± 0.02; among traditional models, Balanced Random Forest had the best recall (0.77) for both classes but overall underperformed ResNet-18.
  • External validation: DermEducation—ResNet-18 achieved AUROC 0.87 and F1 0.91; balanced trees achieved AUROC 0.82 and F1 0.80. ITA-to-FST mapping performed worst (F1 0.36).
  • Representation quantification in textbooks: Across four commonly used dermatology textbooks, FST V–VI images are underrepresented, each book having ≤10.5% FST V–VI among skin images; overall, brown and black skin tones constitute only about 10.5% of skin images.
  • Efficiency: STAR-ED replicates manual bias assessments that previously required >100 person-hours in minutes.
Discussion

STAR-ED addresses the research need for scalable, objective assessment of skin tone representation in dermatology educational materials. By automating image extraction, skin image selection, segmentation, and tone classification, the framework enables large-scale audits that were previously impractical. Results show robust skin image detection across diverse textbooks and strong skin tone classification performance on external datasets, reproducing known biases—marked underrepresentation of FST V–VI. The system’s modularity and use of lightweight segmentation plus a fine-tuned ResNet enable practical deployment across varied document formats (PDFs, scanned images, slides, word documents). While the segmentation is approximate, downstream tone estimation remains accurate enough to quantify representation disparities. The findings reinforce literature reports of inequity in skin tone depiction and provide a tool for educators and publishers to identify and address biases pre- and post-publication.

Conclusion

This work introduces and validates STAR-ED, an end-to-end machine learning framework that automatically ingests educational materials to quantify skin tone representation. STAR-ED demonstrates high performance for skin image detection and skin tone estimation, and confirms a significant underrepresentation of FST V–VI in core dermatology textbooks. The framework enables rapid, scalable bias assessments, offering practical utility for educators, publishers, and clinicians. Future work includes integrating multimodal content (e.g., associated text, tables), improving segmentation to exclude lesional skin, exploring more granular tone classifications and alternative skin tone scales, piloting with publishers globally, and extending the approach to other domains to assess representation of diverse populations.

Limitations
  • Current pipeline analyzes images only; non-image content (text, author lists, tables) is not considered, limiting multimodal context for representation analysis.
  • Skin pixel segmentation does not explicitly exclude lesional/diseased skin regions, which may bias tone estimation away from true baseline skin tone.
  • Skin tone classification is binary (FST I–IV vs. V–VI), reducing granularity across the spectrum.
  • Image-based tone assessment is sensitive to lighting, camera color balance, and photography conditions.
  • Ground-truth labels for some datasets were provided by trained non-experts; while validated against subsets with expert labels and showing high agreement, labeling variability remains.
  • Dependence on the Fitzpatrick scale, which has known biases and subjectivity; alternative scales may be preferable in future iterations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny