logo
ResearchBunny Logo
Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams

Medicine and Health

Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams

Y. Shen, F. E. Shamout, et al.

Discover how a groundbreaking AI system developed by Yiqiu Shen and colleagues has achieved radiologist-level accuracy in detecting breast cancer through ultrasound images, significantly reducing false positives and unnecessary biopsies while enhancing diagnostic sensitivity.

00:00
00:00
~3 min • Beginner • English
Introduction
Breast cancer is the most frequently diagnosed cancer among women, where early detection improves outcomes. Mammography, while standard, has reduced sensitivity in dense breasts and limited accessibility in some settings. Ultrasound (US) is widely used as a complementary or primary diagnostic modality, offering advantages such as lower cost, absence of ionizing radiation, and real-time lesion characterization. However, US interpretation suffers from inter-observer variability and high false-positive rates that contribute to unnecessary recalls and biopsies. Prior computer-aided diagnosis (CAD) and deep learning approaches for breast US have often required labor-intensive image- or pixel-level annotations, used small datasets, focused mainly on differentiating benign vs malignant in lesion-containing images only, and offered limited interpretability. The research questions are whether a weakly supervised AI system trained on large-scale breast-level labels can achieve radiologist-level detection of malignancy on US exams, reduce false positives and unnecessary biopsies without compromising sensitivity, generalize across patient subgroups and imaging devices, and provide interpretable outputs via saliency localization.
Literature Review
The paper reviews prior CAD systems that relied on handcrafted features, which struggled to generalize across acquisition protocols and devices. More recent deep learning studies have improved performance but predominantly used small, exhaustively labeled datasets with image- or pixel-level annotations, limiting scalability and diversity. Many works focused only on classifying lesion-containing images, despite most screening exams being negative, and provided limited interpretability, rendering models as black boxes. The authors highlight gaps: lack of large-scale diverse datasets, label efficiency, inclusion of negative exams, and interpretability. They motivate a weakly supervised approach using pathology-derived breast-level labels and saliency maps to address these issues, and emphasize the need to evaluate performance across age, breast density, and device manufacturers, including external validation.
Methodology
Dataset: The NYU Breast Ultrasound Dataset comprised 288,767 exams (5,442,907 images) from 143,203 patients imaged between 2010/2012 and 2019 across 20 sites at NYU Langone Health, including B-mode and color Doppler images. Each exam typically had 4–40 images (~18–20 per exam). Labels were assigned at the breast level using pathology reports from biopsies or surgeries indicating malignant or benign findings within a window of 30 days before to 120 days after the US exam. The cohort included 28,914 exams associated with pathology; among these, 26,843 breasts had benign and 5,593 had malignant outcomes. Patients were split into training (60%), validation (10%), and internal test (30%) sets with no patient overlap. The internal test set (44,755 exams; 25,003 patients; 79,156 breasts) was further filtered to ensure cancers were visible on positive exams and negatives had cancer-negative biopsy or follow-up US (6–36 months) confirming benign/negative status. An external test set (BUSI) from Cairo, Egypt contained 780 images (437 benign, 210 malignant, 133 negative) from 600 patients to assess generalizability across populations and acquisition protocols. Model: A deep learning system processes all images from a breast US exam. It uses a convolutional backbone to generate saliency maps indicating likely benign and malignant regions per image (weakly supervised localization using breast-level labels only). An attention-based aggregation (gated attention mechanism) weights images and regions by their relevance, combining information across images to produce a breast-level malignancy probability. The model outputs interpretable saliency maps and a malignancy score. Training employs weak supervision using breast-level pathology-derived labels, with regularization; multi-task elements used benign labels to aid training. Hyperparameters were tuned on the validation set; 30 candidate models were trained for 50 epochs with varying learning rates and pooling thresholds. An ensemble of the top 3 models (by validation AUROC) produced final predictions. Data augmentation included flips, rotations, translations, scaling, and sharpening; images were resized and normalized. Test-time augmentation averaged predictions over multiple augmented variants. Implementation used Python/OpenCV/PyTorch; metrics computed with scikit-learn. Reader study: From the internal test set, 663 exams (644 patients; 1,024 breasts) were selected: 73 breasts with pathology-proven cancer, 353 with benign biopsy, and 416 with likely benign assessments plus 1–2 year follow-up imaging. Ten board-certified breast radiologists (3–30 years experience; most fellowship-trained) interpreted the US images, assigning BI-RADS categories with limited clinical context (age, annotations, technologist notes). The AI had no clinical context. ROC and PR curves were generated by comparing radiologist BI-RADS to ground truth; BI-RADS ≥4 was considered positive. A hybrid model combined each reader’s BI-RADS with the AI score (weighted average) to evaluate potential human-AI collaboration. Statistical analysis: Primary metrics included AUROC and AUPRC for probabilistic predictions; sensitivity, specificity, biopsy rate, PPV, and NPV were computed at matched operating points (e.g., AI matched to average radiologist sensitivity or specificity). Confidence intervals were estimated via bootstrapping. Subgroup analyses stratified performance by age, mammographic density, and US device manufacturer; additional analyses focused on the biopsied population and malignant lesion subtypes. Qualitative assessment of saliency maps illustrated agreement and disagreement cases between AI and radiologists.
Key Findings
Internal test performance: AUROC 0.976 (95% CI: 0.972–0.980) for breast-level cancer detection. High performance was maintained across subgroups: by age (AUROC 0.964–0.981), by mammographic density (e.g., heterogeneously dense AUROC 0.979; extremely dense AUROC 0.964), and across manufacturers (e.g., GE 0.984, Philips 0.979, Siemens 0.974, Toshiba 0.986). External validation (BUSI): AUROC 0.927. Reader study (1,024 breasts from 663 exams): The AI achieved AUROC 0.962 (95% CI: 0.943–0.979), exceeding the average of 10 radiologists (AUROC 0.924 ± 0.020; 95% CI: 0.905–0.944), with an AUROC improvement of 0.038 (95% CI: 0.028–0.052; P < 0.001). Radiologists’ average sensitivity was 90.1% (SD 3.5%; 95% CI: 86.4%–93.8%) and specificity 80.7% (SD 4.7%; 95% CI: 78.9%–82.6%). At the average radiologist’s specificity, the AI achieved sensitivity 94.5% (95% CI: 84.0%–100%; +4.4%, P = 0.0278). At the average radiologist’s sensitivity, the AI achieved higher specificity 85.6% (95% CI: 83.9%–88.0%; +4.9%, P < 0.001), lower biopsy rate 19.8% (95% CI: 17.9%–22.1%), and higher PPV 32.5% (95% CI: 26.9%–39.2%) compared to average radiologists (biopsy rate 24.3%, PPV 27.1%). Hybrid human-AI models: Combining AI with each radiologist reduced average biopsy recommendations from 24.3% to 17.6% (relative reduction ≈27.8%) and decreased false positives by 37.3% on average, while maintaining the same sensitivity as the respective readers. Hybrid models improved specificity and PPV across all readers (e.g., PPV increases to around 38–40% in analyses), and eliminated a large fraction of false positives in BI-RADS 4A cases (about two-thirds). Biopsied subgroup: Among breasts with pathology-confirmed malignant lesions, radiologists’ AUROC decreased (mean 0.896), while AI achieved AUROC 0.941 (95% CI: 0.922–0.968). AI showed an absolute improvement in true positives by 8.5% and reduced false positives by 7.5% versus readers (both P < 0.001) in this subgroup. Triage analysis: Using a low AI threshold, the system achieved NPV 98.6% with specificity 77.7%, suggesting potential to automatically dismiss a substantial proportion of low-risk benign exams, with an estimated miss rate of approximately 1 cancer per 740 negative predictions.
Discussion
The study demonstrates that a weakly supervised AI system trained with large-scale, pathology-derived breast-level labels can match or exceed radiologist-level performance in breast ultrasound interpretation while improving specificity and PPV and reducing unnecessary biopsies. The system generalized across age groups, breast densities, and multiple device manufacturers and retained strong performance on an external dataset, addressing concerns about overfitting to a single institution or protocol. Interpretable saliency maps provide lesion localization cues, potentially increasing clinician trust and enabling qualitative error analysis. The human-AI hybrid approach yielded consistent improvements for all ten radiologists, significantly reducing false positives and biopsy recommendations without sacrificing sensitivity, indicating that AI can complement radiologist judgment. Beyond decision support, the AI’s high-NPV triage suggests workflows where low-risk exams are auto-dismissed and higher-risk cases prioritized, potentially improving efficiency. These findings address the core problem of high false-positive rates in breast US by providing accurate, consistent, and interpretable assistance, with potential to reduce benign biopsies and associated patient anxiety and costs. Prospective studies are needed to validate real-world impact and integration into clinical practice.
Conclusion
This work presents a radiologist-level AI system for breast ultrasound that uses weakly supervised learning to produce breast-level malignancy predictions with interpretable saliency maps from large-scale clinical data. The AI achieved AUROC 0.976 on an internal test set and outperformed the average of ten expert radiologists in a reader study, while reducing false positives and biopsy recommendations. Human-AI hybrid models further improved diagnostic metrics, highlighting the value of collaborative decision-making. Future research should validate the system prospectively, integrate multimodal imaging (e.g., mammography, MRI) and longitudinal priors, incorporate patient risk factors, and refine interpretability analyses. Successfully deployed, such systems could improve accuracy, consistency, and efficiency in breast imaging care.
Limitations
- Modality scope: The AI uses ultrasound only, whereas clinical practice is often multimodal with mammography and sometimes MRI; integrating multimodal data could further improve performance. - Lack of prior imaging and full clinical context: The AI evaluated single exams without longitudinal comparisons or patient risk factors (e.g., family history, BRCA status). Readers in the study also had limited clinical context relative to typical practice. - Reader study design: Radiologists did not have access to full records or mammography that commonly accompany screening ultrasound; this may not fully reflect clinical decision-making. - Qualitative analyses were limited to a small set of cases; more systematic interpretability comparisons are needed. - External validation dataset was relatively small, from a single device/system, and provided one image per lesion, differing from typical multi-view clinical acquisitions; this likely contributed to a performance gap vs internal testing. - Some reported results include variability and potentially noisy values due to retrospective design; prospective validation is required for definitive assessment.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny