Medicine and Health

Development and evaluation of deep learning algorithms for assessment of acute burns and the need for surgery

C. Boissin, L. Laflamme, et al.

This study showcases the development of deep-learning algorithms for accurate burn assessment, focusing on their performance across different skin types. Conducted by renowned researchers including Constance Boissin and Jian Fransén, the algorithms demonstrate a promising 87.2% accuracy in identifying burns, paving the way for enhanced medical evaluations.

00:00

~3 min • Beginner • English

Index

Introduction

Burns are common injuries with substantial global morbidity and mortality, disproportionately affecting vulnerable populations. Accurate assessment of burn extent and depth is challenging and misdiagnosis is frequent, impacting treatment decisions and resource use. Advanced imaging (e.g., laser Doppler, OCT) is often unavailable, particularly in low-resource settings. mHealth has enabled remote expert consultation, but automated image-based methods could scale diagnostic support. Prior automated approaches (classical features and more recent CNNs) show promise but exhibit biases, small or inadequately labeled datasets, and limited representation of darker skin types. This study aims to: (1) develop a deep-learning algorithm to identify and segment acute burn wounds; (2) develop a classifier to determine whether burns require surgery (deep-partial/full-thickness) versus conservative care; and (3) evaluate performance across Fitzpatrick skin types (1–2 vs 3–6).

Literature Review

Earlier work used hand-crafted features with semi-automated segmentation and multi-class depth classification, achieving moderate accuracy but limited by small datasets and user input requirements. Recent CNN-based methods improved segmentation and depth classification (reported accuracies ~81–95%), yet many relied on web-scraped images and lacked clinical ground truth, introducing bias. Studies have predominantly focused on lighter skin tones, risking poor generalizability; only one prior study mixed Caucasian and African patients, highlighting training complexity across skin types. Remote, expert-supported mHealth assessments can be accurate and accepted by clinicians, suggesting a role for automated tools to complement expert input. Systematic reviews note bias risks and call for better-curated, diverse datasets and robust evaluation protocols.

Methodology

Design: Proof-of-concept development and evaluation of two CNN-based algorithms using a commercially available platform (Aiforia Create/Hub). Two independent models were trained: (1) wound identification/segmentation (burn vs non-burn pixels); (2) wound severity classification (surgery needed vs not needed) based on burn depth. Data sources: Total burn images n=1105 from Sweden (n=391; mostly Fitzpatrick 1–2) and South Africa (n=714; mostly Fitzpatrick 3–6), representing 387 patients (51% children). Of these, 339 images (31%) required surgery per expert assessment. Additional background images (n=536) with visible skin were obtained from public datasets (ImageNet, LFW) to improve discrimination of wounds vs non-wound/background. Inclusion criteria: Acute burns photographed within 48 h post-injury; wounds undressed, cleaned, scrubbed (blisters removed). Diverse real-world capture conditions (device, distance, orientation, lighting) were retained to mimic clinical use; no capture standardization. Annotations: Pixel-level binary masks delineating burn vs normal skin/background using ImageJ or directly in Aiforia. Multiple annotators (trained nurses/medical students) worked under supervision, with verification by burn experts. Surgical classification labels were image-level, derived from clinical expert depth diagnosis (deep-partial/full thickness = surgery; superficial/superficial-partial = no surgery). Preprocessing/scaling: Anthropomorphic measurements used to approximate pixel size and set scale per image on the training platform. Feature unit size set to 125 (segmentation) and 190 (classification). Training/validation/testing: Images split into training/validation (70%; n=773) and an independent test set (30%; n=332), maintaining proportions by skin type and severity. For each algorithm, three training runs were performed with random 70/30 splits for internal validation and hyperparameter selection, followed by a final training on 100% of the training set and evaluation on the held-out test set. Additional stratified trainings/testings were conducted separately for lighter (Fitzpatrick 1–2) and darker (3–6) skin types. Hyperparameters/augmentations (Table 1): 30,000 iterations; weight decay 0.0001; mini-batch size 20; mini-batch per iteration 20; patience 750 iterations without progress; initial learning rate 0.15; augmentations included scale (±40%), aspect ratio (±30%), shear (±30%), luminance (±40%), contrast (±40%), white balance (±5%), compression quality 40–60%, rotation 0–360°. Outcomes and metrics: - Wound identification: Pixel-level sensitivity (recall), precision (PPV), and F1 score per image, aggregated. For background images, proportion with any predicted burn and those with >5% burn pixels were recorded. - Surgical classification: ROC and AUC. Image classified as surgical if ≥1% of wound pixels predicted surgical. Success rate = overall accuracy. Sensitivity = proportion of true surgical images with ≥1% surgical pixels; specificity = proportion of non-surgical images with <1% surgical pixels. 95% CIs via Clopper-Pearson exact method. Ethics: Approvals from Uppsala Regional Ethics Board (2016/279), Stellenbosch HREC (N13/02/024), and UKZN BREC (BCA106/14).

Key Findings

Wound identification/segmentation: - In three-fold trainings: burn detected in 13.1% of non-burn training images (mostly small areas; >5% pixels in 6/1147), and 0.5% of burn images missed. In validation, 20.0% of non-burn images had any predicted burn, with >5% pixels in 6/464. - Aggregated three-fold results: sensitivity 92.5% (training) and 85.1% (validation). - Final training vs test: sensitivity 93.2% (train) and 86.9% (test); test precision 83.4%; test F1 82.9%. - Skin type stratification (test): higher sensitivity in darker skin vs lighter skin (89.3% vs 78.6%; P<0.001), with corresponding F1 87.8% vs 76.9%. Surgical classification (need for surgery): - Three-fold trainings: sensitivity 98% (training) and 96% (validation); specificity 88% (training) and 71% (validation). Final training: sensitivity 99.6%, specificity 93.4%. - Independent test set (n=332): AUC 0.885; success rate 64.5%; sensitivity 92.5% (95% CI 89.1–95.1); specificity 53.6% (95% CI 48.1–59.1). Only 7/93 surgical cases were missed. - By skin type on test set: • Lighter (n=118): AUC 0.863; success rate 78.0%; sensitivity 75.0% (66.2–82.5); specificity 78.6% (70.1–85.6). • Darker (n=214): AUC 0.875; success rate 66.8%; sensitivity 97.3% (94.1–99.0); specificity 51.1% (44.2–58.0).

Discussion

The study demonstrates that CNN-based algorithms can identify and segment acute burn wounds with reasonable accuracy in heterogeneous, real-world images, and can classify surgical need with good sensitivity but modest specificity, especially when aggregating across diverse skin types and settings. The wound identifier performed better in darker skin types, likely due to stronger color contrast between burned and normal skin, while the surgical classifier performed better in lighter skin, potentially reflecting higher and more homogeneous image quality. Compared with literature, segmentation performance (F1 ~83%) aligns with prior deep-learning studies. Classification results are comparable to or better than earlier hand-crafted feature approaches for similar populations but are challenged by dataset diversity and weak supervision. High sensitivity for surgical need suggests utility in triage to reduce missed surgical cases, especially relative to reported clinician performance in some settings. However, false positives and lower specificity may increase unnecessary referrals if used autonomously, indicating a role as decision support rather than replacement. Stratified analyses show performance differences by skin type, underscoring the need for diverse, balanced training data and potentially models that incorporate skin type as an input or adapt to it.

Conclusion

Two deep-learning algorithms were developed and evaluated for acute burn assessment: a wound identifier with high accuracy and a surgical-need classifier with strong sensitivity and moderate specificity. Performance varied by skin type, with better segmentation in darker skin and better surgical classification accuracy in lighter skin. These results support the feasibility of automated, image-based assistance for frontline clinicians. Future work should expand and diversify datasets, include pixel-level severity annotations, explore model designs that account for skin type and imaging heterogeneity, improve specificity, and address integration, usability, and acceptability in clinical workflows.

Limitations

- Surgical-need labels were at the image level (weak supervision), introducing label noise for pixel-wise classification. - Heterogeneity between Swedish and South African datasets (capture devices, settings, image quality); class imbalance with more surgical cases from South Africa. - Potential annotator variability; inter-rater reliability not quantified for this larger dataset. - Use of a commercial platform limits transparency into exact CNN architectures; other modeling approaches (e.g., SVM, decision trees) were not compared. - Dataset size, especially within stratified subsets, may be insufficient for complex generalization across all skin types and conditions. - Non-burn images yielded some false positives; although unlikely in practice, this may affect cascaded pipelines. - Inclusion limited to cleaned, undressed wounds; generalizability to unprepared wounds is unknown. - Lack of additional clinical variables (e.g., mechanism, blanching, capillary refill) that may aid classification. - Dynamic evolution of burns early post-injury complicates ground truth; waiting for healing endpoints was not feasible.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Medicine and Health

A multimodal deep learning approach for the prediction of cognitive decline and its effectiveness in clinical trials for Alzheimer’s disease

C. Wang, H. Tachimori, et al.

Medicine and Health

Physics-informed deep generative learning for quantitative assessment of the retina

E. E. Brown, A. A. Guy, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny