Introduction
Burns, a global health problem causing significant morbidity and mortality, necessitate accurate assessment of their extent and depth to guide treatment decisions. Misdiagnosis can lead to under- or overtreatment and inefficient use of resources. Current visual assessments by clinicians can be unreliable, with error rates as high as 25-39%. Advanced technologies like laser Doppler or optical coherence tomography improve diagnostic accuracy, but their high cost limits accessibility, particularly in low-resource settings. Mobile health (mHealth) applications offer a potential solution by facilitating remote consultation with burn specialists, but automated image-based algorithms represent a further advancement. Previous studies have explored automated burn depth estimation using various techniques, from handcrafted image features to deep learning methods like convolutional neural networks (CNNs). However, many studies suffer from limitations such as small sample sizes, biased datasets (often lacking representation from diverse skin types), and inadequate burn diagnoses. Given these limitations, and the need for improved accuracy and generalizability, this study aimed to develop and evaluate deep-learning algorithms for burn wound identification and classification of surgical need, considering variations in Fitzpatrick skin types. This is crucial for equitable access to accurate burn care across diverse populations.
Literature Review
Several studies have investigated automated burn diagnosis, with varying levels of sophistication. Early approaches used handcrafted image features and required user interaction for burn segmentation and depth classification. More recent studies utilize deep learning, particularly CNNs with transfer learning, to differentiate burn areas from normal skin or other wound types. Deep learning models have demonstrated promising accuracy (81-95%) in classifying burn depth, but these studies often rely on online image datasets with potential biases and lack of appropriate burn diagnoses. A systematic review highlighted the risk of bias and the need for improved results in existing studies, especially concerning the limited representation of diverse skin types. Existing literature shows a gap in studies using representative patient populations, particularly those with darker skin tones, where burns are prevalent. Only one study included patients with both Caucasian and African skin types, highlighting the challenges of training algorithms on mixed skin type datasets. Therefore, there's a need for studies that address the gap in data representation and algorithmic accuracy.
Methodology
This study developed and evaluated two deep-learning CNN models using a commercially available platform (Aiforia Hub). The first model identified and segmented burn wounds, while the second classified burns based on depth and the need for surgery (skin grafting versus conservative treatment). The dataset comprised 1105 burn images and 536 background images, collected from burn centers in Sweden (lighter skin types) and South Africa (mixed and darker skin types). Images were acquired using various cameras and smartphones, mimicking real-world conditions, and manually annotated by trained nurses and medical students under expert supervision. Burn depth assessment was performed by burn experts. Images were split into training/validation (70%) and test (30%) sets. The models were trained three times with random selections of the training set for optimal hyperparameter selection. Performance metrics included F1 score for wound identification, and AUC, sensitivity, and specificity for surgical classification. Analyses were performed on the complete dataset and stratified by skin type. Statistical differences in sensitivity between skin types were analyzed using a non-parametric Mann-Whitney U-test.
Key Findings
The wound identification algorithm demonstrated a high average accuracy, identifying 87.2% of wound areas accurately in the test set. The algorithm performed better in patients with darker skin types. For the wound classification algorithm, the AUC was 0.885 across all skin types, indicating good discriminatory power. However, this model performed better in patients with lighter skin types. In the three-fold training, the wound identification algorithm showed 92.5% sensitivity and 85.1% in the validation set. The final training yielded 93.2% sensitivity and the test set 86.9%. When stratified by skin type, the sensitivity was significantly higher (P<0.001) in darker skin types. The surgical classification algorithm had a 98% sensitivity and 88% specificity in training, 96% sensitivity and 71% specificity in validation, and a success rate of 64.7% in the test set. The AUC was 0.885 overall, 0.863 for lighter skin types, and 0.875 for darker skin types. The success rate was higher for lighter skin types (78.0%) than for darker skin types (66.8%). The algorithm exhibited higher sensitivity but lower specificity in identifying burns that needed surgery. False positives were observed in a notable proportion of the non-burn validation images (20%), highlighting the need for further refinement to reduce false-positive rates.
Discussion
This study demonstrated that deep-learning algorithms can effectively support burn assessment, achieving reasonable accuracy for wound identification and classification of surgical need. The better performance of the wound identifier on darker skin types might be attributed to the more noticeable color difference between burn and normal skin in these patients. Conversely, the improved performance of the wound classifier on lighter skin types could be due to higher image quality and less variation in image acquisition settings. The results for wound identification are comparable to those reported in previous studies, while the surgical classification results are comparable to or surpass some existing methods, particularly in the lighter skin type subset. The high sensitivity of the surgical classification algorithm minimizes the risk of missing a surgical case, which is beneficial for patient outcomes and efficient resource allocation. False-positive rates, particularly in the wound identification algorithm, require further investigation. Future research should focus on enhancing the algorithm's ability to differentiate between burns and other skin lesions, and improve its performance across diverse skin types. The use of open-source software and the exploration of alternative machine learning approaches would also further the field.
Conclusion
This study shows the potential of deep learning algorithms in assisting burn assessment, achieving notable accuracy in wound identification and surgical need classification. While the algorithm demonstrated promising results, particularly for wound identification, further improvement is required to enhance its performance across various skin types and clinical settings. Addressing the false-positive rate and expanding the dataset to improve model generalizability are crucial steps towards clinical implementation. Future research should focus on larger, more diverse datasets, incorporating pixel-level annotations for burn depth, and exploring alternative machine-learning approaches.
Limitations
The study limitations include the image-based classification of surgical need, which was based on expert assessment and could introduce inter-observer variability. The dataset was retrospectively collected, which might affect image quality and annotation consistency. The use of a proprietary deep-learning platform limits the reproducibility of the results using open-source software. The differences in image quality and heterogeneity of data from Sweden and South Africa may have introduced bias. Future work should address these limitations by using standardized protocols for image acquisition, prospective data collection, detailed pixel-level annotations, and the use of open-source platforms to enhance reproducibility.
Related Publications
Explore these studies to deepen your understanding of the subject.