Introduction
Osteosarcoma (OS), or osteogenic sarcoma, is a highly aggressive primary bone cancer originating from mesenchymal cells with osteoblastic differentiation. It is the most prevalent primary bone cancer, particularly affecting children and adolescents during their growth spurt. Despite advancements in treatment—including surgery, chemotherapy, and limb-salvage surgery—OS carries a significant risk of recurrence and metastasis, resulting in 5-year survival rates ranging from 27% to 65% across all age groups. The aggressive nature of OS underscores the critical need for improved diagnostic and treatment strategies to enhance patient prognosis. Machine learning (ML) and artificial intelligence (AI), particularly deep learning (DL), are emerging as promising tools in this area, capable of analyzing complex medical images and potentially identifying subtle patterns indicative of the disease or treatment response. This research focuses on evaluating various deep learning architectures to determine their efficacy in classifying osteosarcoma histopathological images, aiming to contribute to advancements in OS diagnosis and management.
Literature Review
Traditional image classification methods such as Support Vector Machines (SVMs) and Random Forests (RFs) have been used in image classification tasks. However, these methods rely on hand-crafted features extracted from images, making them less effective on large, variable datasets. Deep learning architectures, particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have demonstrated superior performance in various image analysis tasks, surpassing traditional ML approaches in microscopy. CNNs, in contrast to SVMs and RFs, learn features directly from the data, eliminating the need for manual feature engineering. The success of DL, however, hinges on the quality and size of the training dataset and the careful design of the network architecture. The application of ML in histopathology is relatively recent, with early studies using simple image analysis. As technology has progressed, more advanced techniques like texture analysis and pattern recognition have been explored to improve Whole Slide Image (WSI) analysis. WSIs, while offering rich information, present significant challenges due to their enormous size, demanding substantial computational resources. The current study addresses these challenges by using a comparative methodology to evaluate different deep learning architectures and image input sizes.
Methodology
This study employed a publicly available dataset of osteosarcoma cross-sections from UT Southwestern/UT Dallas. The dataset contained 1144 tiles (1024x1024 pixels) manually classified by pathologists into three categories: Non-Tumor (NT), Viable Tumor (VT), and Necrosis (NC). Several state-of-the-art deep learning architectures were evaluated: VGG, ResNet, MobileNetV2, EfficientNet, and ViT. Transfer learning was utilized, leveraging pre-trained models on ImageNet and fine-tuning them on the osteosarcoma dataset. Various sizes (number of parameters) of each architecture were tested to assess the impact of network complexity. Experiments were conducted with different image input sizes (1024x1024, 896x896, 512x512, 256x256 pixels) to investigate the effect of input resolution on model performance. All computations were performed using PyTorch on a single NVIDIA Titan Xp GPU. The Adam optimizer with decoupled weight decay was used, along with cosine annealing learning rate schedule. Data augmentation techniques (random flipping, rotation) were employed to enhance the training data. The networks were evaluated using the F1 score (considering precision and recall) for each class and macro-averaged F1 score for overall performance. A 70/30 train-test split was initially used, followed by a 5-fold cross-validation on the best-performing configuration to obtain a more robust performance estimate. Additional metrics—accuracy, specificity, recall, precision, confusion matrix, and ROC curves—were calculated to provide a comprehensive performance analysis.
Key Findings
The study's findings revealed that larger networks did not consistently outperform smaller ones. In fact, MobileNetV2, the smallest network, achieved the highest macro-averaged F1 score, particularly when trained with smaller (256x256 pixels) input images. EfficientNetB0 also demonstrated comparable performance to its larger variants. Similarly, ResNet34 outperformed ResNet50, indicating that model size wasn't directly correlated with performance on this dataset, likely due to overfitting observed in larger models. Increasing image size did not uniformly improve results; in some cases, it even led to overfitting. The best-performing configuration (MobileNetV2 with 256x256 pixel input) achieved an overall accuracy of 0.91 after 5-fold cross-validation, with mean F1 scores of 0.95, 0.90, and 0.85 for NT, VT, and NC respectively. The lower F1 score for NC was attributed to the presence of ambiguous images containing both VT and NC; removing these images substantially improved the performance (overall accuracy of 0.96 and mean recalls of 0.95, 0.98, and 0.93 for NT, VT, and NC respectively). The ROC analysis demonstrated consistent and high AUC values across all folds for all classes, suggesting good generalizability.
Discussion
The results challenge the assumption that larger, more complex networks are always superior in image classification. This study demonstrates that appropriately sized networks, trained with suitably resized images, can achieve comparable or even superior performance with fewer computational resources. The observed overfitting in larger models highlights the importance of considering dataset size and the risk of overfitting when using deep learning architectures. The significant improvement in performance after removing ambiguous images emphasizes the necessity of high-quality, consistently annotated data for training deep learning models. The consistent high performance across the 5-fold cross-validation and ROC analysis indicates the robustness and generalizability of the findings.
Conclusion
This research evaluated various deep learning architectures for osteosarcoma image classification. Smaller, more efficient networks like MobileNetV2 and EfficientNetB0 demonstrated superior performance on a downsampled dataset. Five-fold cross-validation confirmed the robustness of MobileNetV2, achieving high accuracy and recall. Future work should focus on optimizing hyperparameters, using larger and more diverse datasets, and exploring aggressive regularization techniques to further improve model performance and generalizability.
Limitations
The study utilized a relatively small dataset, potentially limiting the generalizability of the findings. The hyperparameter selection was not exhaustive, and optimization might yield better results for some architectures. The dataset primarily contained images of single tissue types, with limited samples showing mixed tissue categories, which could impact the model's performance in real-world scenarios with more complex images. Future research with larger, more diverse datasets representing various tumor types, stages, and demographics is needed to validate and extend these results.
Related Publications
Explore these studies to deepen your understanding of the subject.