Medicine and Health

Deep learning in image-based breast and cervical cancer detection: a systematic review and meta-analysis

P. Xue, J. Wang, et al.

This groundbreaking meta-analysis investigates the efficacy of deep learning algorithms in the early detection of breast and cervical cancers. The research, conducted by a team of experts including Peng Xue and Jiaxu Wang, reveals a pooled sensitivity of 88% and specificity of 84%. However, the study highlights the need for standardized guidelines to ensure the reliability of these algorithms.

00:00

~3 min • Beginner • English

Index

Introduction

Breast and cervical cancers are leading causes of cancer morbidity and mortality globally, disproportionately affecting women in low- and middle-income countries where access to early diagnosis and expert interpretation of imaging is limited. Imaging modalities such as mammography, ultrasound, cytology, and colposcopy are central to early detection but are subject to variability and resource constraints. Deep learning (DL) offers potential to automate and standardize image interpretation. This study systematically evaluates the diagnostic accuracy of DL algorithms for early detection of breast and cervical cancer from medical images and examines performance across subgroups by cancer type, validation strategy, imaging modality, and versus human clinicians.

Literature Review

Prior systematic reviews/meta-analyses on DL in medical imaging exist but are heterogeneous in scope. Liu et al. reported DL performance comparable to healthcare professionals in limited domains (notably breast and dermatology), highlighting generalizability issues. Aggarwal et al. found high diagnostic performance but substantial heterogeneity due to methodological variation, urging caution and better AI guidelines. Zheng et al. showed DL can match or exceed clinicians for detecting tumor metastasis in radiology, yet noted methodological deficiencies. Few reviews specifically target breast and cervical cancer imaging, underscoring the need for focused synthesis of DL diagnostic accuracy in these areas.

Methodology

Protocol registered in PROSPERO (CRD42021252379). Conducted per PRISMA guidelines. Databases searched: MEDLINE, Embase, IEEE, and Cochrane Library through April 2021 with no regional, language, or publication type restrictions; letters, scientific reports, conference abstracts, and narrative reviews were excluded. Two investigators independently screened and extracted data using a predefined sheet; disagreements resolved by a third reviewer. Extracted diagnostic accuracy data included true positives, false positives, true negatives, and false negatives to build contingency tables; multiple tables per study/algorithm were treated as independent. Risk of bias and applicability were assessed using QUADAS-2. Statistical analysis employed hierarchical summary receiver operating characteristic (HSROC) models to estimate pooled sensitivity, specificity, and AUC with 95% CIs and prediction regions. Heterogeneity was quantified with I². Subgroup meta-analyses and meta-regressions explored sources of heterogeneity by: (1) validation type (internal vs external), (2) cancer type (breast vs cervical), (3) imaging modality (mammography, ultrasound, cytology, colposcopy), and (4) DL algorithms vs human clinicians using the same dataset. Random-effects models were used. Publication bias was assessed visually by funnel plots. Meta-analysis was performed when at least three studies were available. Software: STATA 15.1 and R 4.0; two-sided tests, p < 0.05.

Key Findings

- Study selection: 2252 records identified (MEDLINE 518, Embase 1141, IEEE 576, Cochrane 17); 224 duplicates removed; 2028 screened; 71 full-text assessed; 35 studies included in qualitative synthesis; 20 studies included in meta-analysis (sufficient data for contingency tables). - Overall pooled performance (all DL algorithms; 20 studies): sensitivity 88% (95% CI 85–90%), specificity 84% (95% CI 79–87%), AUC 0.92 (95% CI 0.90–0.94). When considering the highest-performing algorithm per study: sensitivity 88% (86–92%), specificity 85% (79–90%), AUC 0.93 (0.91–0.95). - Subgroups: • Validation type: Internal validation (15 studies, 40 tables): sensitivity 89% (87–91%), specificity 83% (78–86%), AUC 0.93 (0.91–0.95). External validation (8 studies, 15 tables): sensitivity 83% (77–88%), specificity 85% (73–92%), AUC 0.90 (0.87–0.92). • Cancer type: Breast cancer (10 studies, 36 tables): sensitivity 90% (87–92%), specificity 86% (80–89%), AUC 0.94 (0.91–0.96). Cervical cancer (10 studies, 19 tables): sensitivity 83% (77–88%), specificity 80% (70–78%), AUC 0.89 (0.86–0.91). • Imaging modality: Mammography (4 studies, 15 tables): sensitivity 87% (82–91%), specificity 88% (79–93%), AUC 0.93 (0.91–0.95). Ultrasound (4 studies, 11 tables): sensitivity 91% (89–93%), specificity 85% (80–89%), AUC 0.95 (0.93–0.96). Cytology (4 studies, 6 tables): sensitivity 86% (68–95%), AUC 0.91 (0.88–0.93). Colposcopy (4 studies, 11 tables): sensitivity 78% (69–84%), specificity 76% (63–87%), AUC 0.84 (0.81–0.87). • DL vs human clinicians (11 studies): DL sensitivity 87% (84–90%), DL specificity 83% (76–89%), AUC 0.92 (0.89–0.94); human clinicians specificity 82% (72–86%), AUC 0.92 (0.89–0.94) on the same datasets. - Heterogeneity: Very high across studies—Sensitivity I² = 97.65%; Specificity I² = 99.90%. Subgroup I² values also remained high, indicating heterogeneity not explained by examined covariates. Funnel plot suggested no clear publication bias but wide dispersion. - Study characteristics: 33 studies retrospective; 2 prospective; only 2 pre-specified sample sizes; 11 studies reported external validation; 12 studies compared DL to clinicians; modalities included mammography, ultrasound, cytology, colposcopy, MRI. - Quality assessment (QUADAS-2): Patient selection domain high/unclear risk in 13 studies (often due to unreported or improper inclusion/exclusion). Index test domain mostly low risk; one study high/unclear due to no predefined threshold. Reference standard domain high/unclear in 3 studies (inconsistencies, unclear thresholds). Timing domain high/unclear in 5 studies (unclear time intervals). Applicability concerns: 12 studies high/unclear for patient selection; one unclear for index test.

Discussion

The meta-analysis indicates that DL algorithms achieve diagnostically acceptable performance for early detection of breast and cervical cancers from medical imaging and are broadly comparable to human clinicians on the same datasets. However, the evidence base is limited by methodological weaknesses that likely inflate estimates, notably reliance on internal validation, small or retrospective datasets, inconsistent definitions (e.g., of “validation”), and poor reporting standards. High heterogeneity persisted across subgroups, suggesting that differences in validation approach, cancer type, or imaging modality do not fully explain variability; other unmeasured factors (dataset composition, preprocessing, thresholds, reader expertise) may contribute. The findings support DL as a potential tool to augment diagnostic capacity, particularly in settings with limited specialist availability, but emphasize the need for robust external validation, multicenter prospective studies, standardized image acquisition and preprocessing, clearer reporting (aligned with STARD/TRIPOD and AI-specific CONSORT-AI/SPIRIT-AI), and integration strategies that combine DL outputs with clinician expertise. Interpretability (e.g., saliency/heatmaps) and workflow integration are critical to clinical adoption. Data sharing consortia and standardized pipelines could improve generalizability and reduce bias.

Conclusion

DL algorithms show promise for detecting breast and cervical cancer using medical imaging, with pooled sensitivity and specificity comparable to clinicians. Nonetheless, the current evidence is constrained by poor study designs, limited external validation, and high heterogeneity, implying possible overestimation of performance. Future work should prioritize prospective, multicenter, externally validated studies; standardized methodologies and reporting; diverse, representative datasets to mitigate bias; interpretability and clinician-in-the-loop designs; and evaluation of real-world workflow integration. Such efforts are needed before widespread clinical adoption.

Limitations

- Extreme between-study heterogeneity (I² > 97% for sensitivity and > 99% for specificity) not resolved by subgroup analyses. - Predominantly retrospective designs; few prospective studies and limited external validation (11 studies). - Small numbers within modality-specific subgroups (e.g., cytology, colposcopy), limiting precision and generalizability. - Potential reporting and selection biases, including predominance of positive findings and inconsistent use of performance metrics. - Inconsistent terminology and use of “validation” across studies; thresholds often not predefined or clearly reported. - Inclusion restricted to studies with histopathology reference standard, possibly excluding relevant DL studies without such confirmation. - Limited information on image acquisition standardization and pre-analytical variability, affecting reproducibility and generalizability.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

The effects of mindfulness-based interventions on anxiety, depression, stress, and mindfulness in menopausal women: A systematic review and meta-analysis

Hl, Hz, et al.

Psychology

Diet, gym, supplements, or maybe it is all in your mind? A systematic review and meta-analysis of studies on placebo and nocebo effects in weight loss in adults

Ł. Kryst, P. Bąbel, et al.

Medicine and Health

Nutritional and Exercise-Focused Lifestyle Interventions and Glycemic Control in Women with Diabetes in Pregnancy: A Systematic Review and Meta-Analysis of Randomized Clinical Trials

C. F. Dingena, D. Arofikina, et al.

Medicine and Health

Effectiveness of virtual reality therapy in the treatment of anxiety disorders in adolescents and adults: a systematic review and meta-analysis of randomized controlled trials

W. Zeng, J. Xu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny