Medicine and Health

Deep learning-driven diagnosis of multi-type vertebra diseases based on computed tomography images

Y. Wang, Feng, et al.

Discover the groundbreaking work of Yongjie Wang and colleagues as they unveil a deep learning-driven diagnostic system that accurately identifies osteoporotic vertebral compression fractures and other vertebra diseases from CT images, paving the way for improved treatment options.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of accurately diagnosing osteoporotic vertebral compression fractures (OVCFs) and differentiating them from other vertebral conditions—old fractures (OFs), Schmorl’s node (SN), Kummell’s disease (KD), and signs of previous surgery (PS)—using computed tomography (CT) imaging. OVCFs are common and associated with substantial morbidity, underdiagnosis, and delayed or inadequate treatment, partly due to insidious onset and limitations of standard imaging modalities. X-rays are fast but have relatively low sensitivity and specificity, while MRI is the diagnostic gold standard but is costly, time-consuming, and sometimes contraindicated. CT offers higher accuracy than X-ray and greater availability than MRI but subtle fractures may be missed. Prior deep learning (DL) studies have primarily performed scan-level OVCF detection without vertebra-level localization and have not attempted refined multi-disease classification. The authors hypothesize that CT images contain sufficient features for DL models to perform vertebra-level, multi-type disease diagnosis, improving clinical decision-making and workflow efficiency.

Literature Review

The paper reviews imaging approaches for OVCF diagnosis, highlighting limitations of X-ray (lower sensitivity/specificity; diaphragm/lung overlap issues) and constraints of MRI (cost, time, contraindications), and positions CT as a practical alternative with improved accuracy and availability. Prior DL works include Tomita et al., who used CNN plus LSTM to classify whole-scan CTs with sensitivity 0.85, specificity 0.96, AUC 0.91, and Kolanu et al., who reported CAD performance with specificity 0.92 and sensitivity 0.54. However, these approaches produced scan-level outputs and did not localize specific vertebrae or differentiate multiple disease types. The paper also notes related research on vertebral body segmentation/localization and broader applications of DL in medical imaging, underscoring a gap for vertebra-level, multi-type classification systems validated on large datasets and external cohorts.

Methodology

Study design and cohorts: Retrospective study approved by the Institutional Review Board of Beijing Luhe Hospital (No. 2023-LHKY-022-02), conducted per the Declaration of Helsinki, with informed consent. Patients with OVCFs who underwent X-ray, CT, and MRI were collected from two centers. Luhe Hospital cohort: 1,198 patients from 2015–2020; after exclusions (incomplete data, poor image quality, malignancy, metastasis), 1,051 patients remained. Training cohort: 819 patients, 8,548 sagittal CT slices. Testing cohort: 232 patients, 2,456 slices. External validation: Xuanwu Hospital cohort: 46 patients, 467 slices. Sagittal CT images covered T1–L5. Demographics across cohorts were generally similar (no significant differences in gender or DEXA T-score). Annotation: Three senior spine surgeons independently annotated and then reconciled labels. For detection, all vertebrae were annotated by bounding boxes labeled “vertebra”. For classification, only diseased vertebrae were annotated by bounding boxes with five categories: OVCF, OF, SN, KD, and PS. DL system architecture: The pipeline comprised (1) Vertebra Detection Module (VDModule), (2) Vertebra Extraction Module (VEModule), and (3) Vertebra Classification Module (VCModule). - VDModule: Faster R-CNN with ResNet18 or MobileNet v2 backbones to detect vertebrae and produce bounding boxes. Input images were normalized to 1,024×1,024 via center-crop and zero-padding to avoid deformation (original sagittal CT sizes varied). Offline data augmentation expanded images and boxes eight-fold. Evaluation included precision-recall curves, mean average precision (mAP), vertebra count agreement, and spatial overlap via intersection-over-minimum (IoM), defining hits at IoM > 0.5. - VEModule: Extracted multiple vertebra patches per detected box using scaling and random translation to augment patch-level data. - VCModule: Multi-output (multi-label) classification using a pretrained ResNet50 with transfer learning to independently predict OVCF, OF, SN, KD, PS, and normal. To construct normal samples, detected vertebrae that did not overlap with diseased boxes were selected. Class imbalance was addressed by random under-sampling of the majority class (normal) to twice the OVCF count and random over-sampling of minority classes (OF, SN, KD, PS) to half the OVCF count in training only. Data augmentation mitigated overfitting. Datasets: - Original training vertebra patches: OVCF 8,046; OF 3,257; SN 824; KD 478; PS 1,979; normal 47,604 (total diseased 14,584; normal 47,604). - Testing vertebra patches: OVCF 2,536; OF 893; SN 154; KD 204; PS 702; normal 15,122 (diseased 4,489; normal 15,122). - External validation vertebra patches: OVCF 398; OF 66; SN 23; KD 56; PS 102; normal 2,425 (diseased 645; normal 2,425). Evaluation metrics: For detection, mAP, vertebra count correlation, FP and FN rates using IoM > 0.5. For classification, one-vs-all confusion matrices and ROC curves, and per-class sensitivity, specificity, PPV (precision), NPV, and F1 score. Statistical analysis used GraphPad 7.0 and IBM SPSS 26 with chi-square tests and ANOVA; significance set at P<0.05.

Key Findings

- Vertebra detection (VDModule): Using ResNet18-based Faster R-CNN on the Luhe Hospital test set (3,212 vertebrae), the model achieved mAP 0.9823, AUC 0.982, FP rate 1.52%, FN rate 1.33%, and strong linear correlation between detected and annotated vertebra counts per slice. - Vertebra classification (VCModule) on Luhe Hospital test set (4,489 diseased; 15,122 normal): Overall high performance for OVCF, OF, KD, and PS with average sensitivity 0.919 and specificity 0.995 (excluding SN). Per-class examples from Table 3: OVCF sensitivity 0.958, specificity 0.986, F1 0.934; OF sensitivity 0.808, specificity 0.996, F1 0.856; KD sensitivity 0.913, specificity 0.999, F1 0.913; PS sensitivity 0.997, specificity 0.999, F1 0.992. SN performance was lower: sensitivity 0.756, specificity 0.995, PPV 0.532, NPV 0.998, F1 0.624. - External validation (Xuanwu Hospital; 645 diseased; 2,425 normal): The VCModule maintained good performance for OVCF, OF, KD, and PS with average sensitivity 0.891 and specificity 0.989 (excluding SN). SN performance was poor: sensitivity 0.213, specificity 1.000, PPV 0.842, NPV 0.994, F1 0.340. - Training vs testing F1 scores showed no significant differences (t-test, P=0.390), indicating no overfitting for the four well-performing disease classes. - The system provided vertebra-level localization and diagnosis across five categories, addressing a gap in prior works that focused on scan-level OVCF detection.

Discussion

The findings support the hypothesis that CT images contain sufficient discriminative features for deep learning models to perform vertebra-level, multi-type disease diagnosis. The two-stage pipeline achieved accurate detection of vertebrae and high classification performance for OVCF, OF, KD, and PS across internal and external datasets, surpassing reported sensitivities in prior OVCF-only studies and adding fine-grained, vertebra-level outputs that are clinically actionable. The system’s robustness on an external cohort with different scanners/parameters suggests generalizability. Compared with X-ray–based approaches that can be compromised by anatomical overlap, the CT-based DL system demonstrates high sensitivity and specificity and practical feasibility. The 2D approach proved effective and computationally efficient, though 3D models might further improve tasks relying on spatial context. Clinically, the system could be integrated into workflows to triage and highlight high-risk vertebrae in real time, potentially improving efficiency and consistency. Continuous learning through expert review of difficult cases could further enhance performance.

Conclusion

The study introduces a CT-based, deep learning system capable of vertebra-level, multi-type diagnosis, accurately identifying OVCF, OF, KD, and PS, with external validation demonstrating generalizability. This refined diagnostic capability can facilitate faster, more accurate clinical decision-making. Future work should expand multicenter datasets, especially for SN, explore advanced architectures (including 3D models), incorporate vertebra identification/positioning and patient-level aggregation, and directly benchmark against expert radiologists to quantify clinical utility.

Limitations

- Class imbalance: Notably fewer SN and KD cases compared with OVCF, OF, PS, and normal; addressed through oversampling/undersampling, which may introduce bias. - Output granularity: Current system operates at vertebra- and slice-level rather than patient-level; it does not determine vertebral position or provide holistic spine assessment, limiting tasks such as overall spine health evaluation or refracture risk prediction. - SN performance: Sensitivity and PPV for SN were poor, likely due to small sample size and feature similarity to OVCF/OF; improved datasets and architectures are needed. - Use of 2D models: While effective and efficient, 3D spatial features might benefit certain assessments (e.g., morphology), suggesting potential gains from 3D DL approaches.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Automated detection of type 1 ROP, type 2 ROP and A-ROP based on deep learning

E. K. Yenice, C. Kara, et al.

Medicine and Health

Machine learning-based prediction of COVID-19 diagnosis based on symptoms

Y. Zoabi, S. Deri-rozov, et al.

Medicine and Health

Recent Advancements and Perspectives in the Diagnosis of Skin Diseases Using Machine Learning and Deep Learning: A Review

J. Zhang, F. Zhong, et al.

Medicine and Health

Machine learning for accurate estimation of fetal gestational age based on ultrasound images

L. H. Lee, E. Bradburn, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny