Engineering and Technology
Deep Learning Approaches to Osteosarcoma Diagnosis and Classification: A Comparative Methodological Approach
I. A. Vezakis, G. I. Lambrou, et al.
Discover how Ioannis A Vezakis, George I Lambrou, and George K Matsopoulos are revolutionizing osteosarcoma diagnosis with machine learning. Their study shows that smaller networks like MobileNetV2 can outperform larger models, achieving an impressive 91% accuracy in histopathological evaluations. Dive into the intersection of technology and medicine!
~3 min • Beginner • English
Introduction
Osteosarcoma (OS) is an aggressive primary bone malignancy arising from mesenchymal cells with osteoblastic differentiation. Despite technological advances, histopathology remains the gold standard for staging and treatment decisions. Traditional machine learning approaches (e.g., SVMs, Random Forests) depend on hand-crafted features and struggle with variability and scale. Modern deep learning methods (CNNs, Vision Transformers) learn features directly and have achieved human-level performance in many vision tasks, but their success depends on data quality, dataset size, and architecture design. Whole slide images (WSIs) are extremely large, necessitating analysis via smaller image tiles. This study investigates which deep learning architectures, depths, and input image sizes most effectively classify osteosarcoma histopathology tiles into non-tumor (NT), viable tumor (VT), and necrosis (NC), aiming to identify configurations that improve diagnostic accuracy and efficiency.
Literature Review
Early histopathology image analysis (1990s) used simple image processing (thresholding, edge detection) and later texture-based pattern recognition. Classical ML (SVM, Random Forest) relies on hand-crafted features and is sensitive to feature quality, often underperforming on large, variable datasets. Deep learning, especially CNNs and Vision Transformers, has delivered strong results in microscopy, obviating manual feature engineering and leveraging transfer learning from large datasets like ImageNet. WSIs pose computational challenges due to size, prompting tile-based approaches. Prior osteosarcoma works reported varied strategies and results: hierarchical two-stage classification (non-tumor vs tumor, then VT vs NC) achieved high accuracies; VGG-based approaches reported strong results with different hyperparameters and holdout splits; hybrid methods combining deep features with classical classifiers (e.g., RBF-SVM with feature selection) reported very high accuracy on small test subsets. However, comparisons can be confounded by holdout strategies on limited samples, potential data leakage when patching tiles, and differing pre-processing, architectures, and optimization settings. This context motivates a controlled, comparative evaluation across multiple state-of-the-art networks under a consistent protocol.
Methodology
Dataset: Public Osteosarcoma dataset from UT Southwestern/UT Dallas for Viable and Necrotic Tumor Assessment (TCIA). Forty WSIs (H&E stained) were selected by two pathologists. From each WSI, 30 random 1024×1024 tiles were sampled (1,200 total); after filtering non-tissue, ink, and blurry regions, 1,144 tiles remained and were labeled as Non-Tumor (NT, 536; 47%), Necrosis (NC, 263; 23%), and Viable Tumor (VT, 345; 30%). Notably, 53 of the 263 NC images (20%) contained segments of VT (ambiguous).
Experimental setup: Implemented in PyTorch, trained on a single NVIDIA Titan Xp (12 GB). Transfer learning was used by fine-tuning ImageNet-pretrained weights. Source code: Zenodo (doi:10.5281/zenodo.7765031).
Architectures: EfficientNetB0/B1/B3/B5/B7, MobileNetV2, ResNet18/34/50, VGG16/VGG19 (with modified classifier: replacing the 4096-neuron FC with two FC layers of 512 and 1024 to adapt to 3 classes), and Vision Transformer ViT-B/16. Model sizes ranged from ~2.2M to ~86M parameters. ViT-B/16 used 224×224 inputs due to pretrained weight constraints.
Training protocol: 70/30 train/test split kept consistent across models for fair comparison. Optimizer: AdamW (Adam with decoupled weight decay). Initial LR 3e-4 with cosine annealing to 1e-5 over 100 epochs. Default hyperparameters were used to focus on architectural comparison rather than per-model tuning. Batch size depended on input size and GPU memory: at 1024×1024, batch=2; when halving input size, batch size doubled. EfficientNetB7 could not fit at larger sizes and was limited accordingly. Input sizes evaluated: 1024×1024, 896×896 (for some large nets), 512×512, and 256×256. Images were downsampled via bilinear interpolation when applicable.
Preprocessing and augmentation: Per-channel normalization using ImageNet means (R/G/B: 0.485, 0.456, 0.406) and standard deviations (0.229, 0.224, 0.225). Data augmentation included random horizontal/vertical flips and random rotations within ±20 degrees.
Evaluation metrics: Primary comparison via macro and per-class F1 scores using One-vs-Rest (OvR). Additional metrics in follow-up cross-validation: accuracy, specificity, recall, precision; aggregated confusion matrix across folds; ROC and AUC per class.
Follow-up experiment: The best configuration from the initial comparison was retrained with 5-fold cross-validation (same training settings) to obtain robust performance estimates independent of a single split, and to compute averaged metrics and ROC/AUC across folds. An additional analysis excluded ambiguous images (tiles containing both VT and NC) to assess their impact.
Key Findings
- Larger architectures did not necessarily outperform smaller ones; overfitting was common with many networks achieving near-perfect training accuracy.
- MobileNetV2, the smallest network (~2.2M parameters), with 256×256 input achieved the best macro-averaged F1 among tested configurations in the holdout comparison.
- Input image size increases generally did not improve performance and sometimes worsened it due to overfitting; some EfficientNet variants showed slight gains at larger sizes, consistent with their compound scaling pretraining.
- VGG models initially failed to learn at LR=3e-4 (Adam), converging to suboptimal solutions; with a much lower LR (1e-5), performance became comparable to similarly sized networks (not included in main comparisons to avoid selective tuning).
- Five-fold cross-validation of MobileNetV2 at 256×256 yielded: mean F1 (NT 0.95±0.02, VT 0.90±0.04, NC 0.85±0.03); mean accuracy 0.95±0.02; specificity 0.96±0.02–0.03; recall (NT 0.95±0.03, VT 0.93±0.05, NC 0.83±0.05); precision (NT 0.95±0.03, VT 0.88±0.05, NC 0.88±0.05). Overall accuracy across folds was ~0.91.
- Aggregated confusion matrix over 5 folds (n=1144): NT (TP 510, misclassified as VT 7, NC 19); VT (TP 272, misclassified as NT 3, NC 17); NC (TP 262, misclassified as NT 24, VT 30). Misclassifications often involved NC, which shares visual features with NT and VT; NT and VT were rarely mistaken for each other.
- ROC/AUC across folds showed high discrimination: AUC ~0.98–1.00 (NT), 0.97–0.99 (VT), 0.95–0.97 (NC).
- Removing ambiguous images (tiles containing both VT and NC) improved performance: mean recalls increased to NT 0.95, VT 0.98, NC 0.93; overall accuracy rose to 0.96. Mean F1s: NT 0.96±0.03, VT 0.97±0.02, NC 0.93±0.03.
Discussion
The study directly addressed the question of which deep learning architectures and input sizes best classify osteosarcoma histopathology tiles into NT, VT, and NC under a unified training and evaluation protocol. Results indicate that smaller, efficient architectures like MobileNetV2 can outperform deeper models on limited datasets, and that larger input resolutions do not necessarily yield better classification, likely due to increased overfitting risk and limited training data. The consistent cross-validated performance and high AUCs support the robustness of the selected configuration. Error analysis revealed that NC often shares features with NT and VT, explaining lower recall for NC and suggesting that label ambiguity contributes to misclassifications; curating or explicitly modeling such ambiguity improves performance. Comparisons with prior works suggest that hierarchical classification strategies and hybrid feature pipelines (deep features + classical ML) may further enhance performance, though reported very high accuracies on small holdout sets may reflect optimistic estimates or data leakage risks. Overall, the findings underscore the importance of careful model-size selection, input resolution choice, and rigorous validation to achieve clinically meaningful performance in histopathological OS classification.
Conclusion
This work systematically compared state-of-the-art CNNs and a ViT for osteosarcoma histopathology tile classification under consistent training and evaluation. Smaller, computationally efficient networks (MobileNetV2, EfficientNetB0) performed best when images were downsampled to 256×256, while many larger models overfit. Retraining MobileNetV2 with 5-fold cross-validation confirmed robust performance (overall accuracy ~0.91; per-class F1s NT 0.95, VT 0.90, NC 0.85) and strong AUCs; removing ambiguous tiles further improved accuracy to ~0.96. The study highlights that optimal network and input size selection can outperform merely scaling model capacity. Future work should explore stronger regularization and data-efficient strategies (pretraining on related large datasets, richer augmentations, dropout), hierarchical classification, and segmentation-level labels to handle mixed-tissue tiles, ultimately aiming for generalizable performance on larger, multi-center datasets.
Limitations
- Hyperparameters were largely held constant across models to focus on architectural comparison; some architectures may benefit from tailored optimization and could perform better with careful tuning.
- The dataset had limited ambiguous tiles; excluding mixed VT/NC images improved performance, indicating that real-world scenarios with mixed tissue may reduce accuracy. Pixel-level annotations enabling segmentation could better address such cases.
- Small dataset size and limited patient diversity (images from four patients) constrain generalizability. Differences in tumor types, stages, and demographics across centers may impact performance; larger, multi-center datasets are needed to validate findings.
Related Publications
Explore these studies to deepen your understanding of the subject.

