
Medicine and Health
Segment anything in medical images
J. Ma, Y. He, et al.
Discover MedSAM, a groundbreaking foundation model developed by Jun Ma and colleagues, that revolutionizes medical image segmentation, ensuring increased accuracy and reliability across diverse clinical tasks. This model is a game changer, trained on millions of image-mask pairs to significantly enhance diagnosis and treatment!
~3 min • Beginner • English
Introduction
Segmentation of anatomical structures and lesions in medical images is essential for diagnosis, treatment planning, and disease monitoring, yet manual delineation is time-consuming and expertise-intensive. Deep learning has advanced automatic and semi-automatic segmentation, but most models are trained for specific tasks and modalities, exhibiting poor generalization to unseen tasks, targets, or imaging domains. In contrast, recent natural image segmentation foundation models (e.g., SAM) show strong cross-task versatility.
The clinical need motivates a universal medical segmentation model trained once and applicable broadly across modalities and targets. However, direct application of natural-image foundation models to medical imaging is limited by domain differences and the promptable nature of SAM, which can struggle with weak boundaries and low-contrast targets common in medical data. This work introduces MedSAM, a promptable foundation model tailored to medical imaging, trained on a large-scale, diverse dataset and designed to deliver robust, generalizable segmentation across imaging modalities and clinical tasks.
Literature Review
Prior work on medical image segmentation includes task-specific deep learning models (e.g., U-Net variants and DeepLabV3+) that perform well within their trained domains but generalize poorly to new tasks or modalities. Interactive segmentation methods and promptable approaches have been explored, yet most are limited in scope and modality coverage.
Foundation models for natural images, notably the Segment Anything Model (SAM) and related approaches, have demonstrated broad applicability. Concurrent assessments of SAM on medical tasks reported satisfactory performance primarily for targets with clear boundaries, while revealing significant limitations for typical medical targets with weak boundaries or low contrast. These insights underscore the need for adapting and fine-tuning such foundation models to medical imaging and for assembling large, diverse medical datasets to support universal segmentation.
Methodology
Dataset curation and preprocessing: The authors assembled a large-scale dataset of 1,570,263 image–mask pairs spanning 10 imaging modalities (including CT, MRI, endoscopy, ultrasound, X-ray/CXR, pathology, dermoscopy, mammography, OCT, fundus) and over 30 cancer types, from public sources (e.g., TCIA, Kaggle, Grand-Challenge, MICCAI challenges). 3D CT/MR images were converted to NIfTI; grayscale and RGB images to PNG. Quality control excluded incomplete images, inaccurate annotations, tiny volumes, and branching-structure targets. Intensity normalization standardized modality-specific ranges: CT windowing (soft tissue W:400/L:40; lung W:1500/L:-160; brain W:80/L:40) and rescale to [0,255]; MR/X-ray/ultrasound/mammography/OCT clipped to 0.5–99.5 percentiles then rescaled to [0,255]; RGB rescaled with min–max if needed. All images were resized to 1024×1024×3. For pathology whole-slide images, non-overlapping patches were extracted and padded as needed. 3D CT/MR were processed slice-wise with channel repetition to 3; bi-cubic interpolation for images and nearest-neighbor for masks.
Model architecture: MedSAM follows SAM’s promptable architecture with an image encoder (ViT-Base, 12 transformer layers, 16×16 patch size, producing 64×64 embeddings), a prompt encoder mapping bounding-box corner points to 256-D embeddings via positional encodings, and a lightweight mask decoder (two transformer layers and two transposed convolutions upsampling to 256×256, sigmoid output and bilinear upsampling to full resolution). Bounding boxes were chosen as user prompts due to clearer spatial context than points, and efficiency in multi-object settings.
Training protocol: Data were split 80/10/10 into training/tuning/validation at scan or video level to avoid leakage, with slide-level separation for pathology. External validation datasets and/or unseen targets were held out entirely. The model initialized from SAM ViT-Base; the prompt encoder was frozen; the image encoder (89,670,912 params) and mask decoder (4,058,340 params) were trained. Bounding box prompts were simulated from expert masks with 0–20 pixel random perturbations. Loss was the unweighted sum of binary cross-entropy and Dice loss. Optimization used AdamW (β1=0.9, β2=0.999), learning rate 1e-4, weight decay 0.01, global batch size 160, no data augmentation. Training ran for 150 epochs on 20×A100 80GB GPUs; the last checkpoint was used.
Baselines and specialist models: MedSAM and SAM were evaluated as single models across all modalities. Modality-wise specialist models included nnU-Net (2D configurations; bounding-box converted to binary mask as an extra channel) and DeepLabV3+ (ResNet-50 encoder, inputs resized to 224×224×3, bounding-box mask as extra channel). Ten specialist models were trained per method (one per modality). Additional task-specific nnU-Net models were trained for four representative tasks (CT liver cancer, MR abdominal organs, ultrasound nerve cancer, endoscopy polyp) to compare internal vs external generalization.
Evaluation: Internal validation covered 86 tasks; external validation included 60 tasks (new datasets or unseen targets). Metrics followed Metrics Reloaded: Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD, tolerance r=2). Statistical comparisons used the Wilcoxon signed-rank test. A scaling study trained MedSAM variants with 10K and 100K images to assess performance vs data size.
Human annotation study: On an adrenocortical carcinoma CT dataset (unseen during training and validation), two experienced radiologists annotated 733 tumor slices via two pipelines: (1) manual slice-by-slice; (2) sparse linear markers every 3–10 slices (long/short axes), automatic rectangle mask generation/interpolation to create bounding boxes, MedSAM inference, and expert refinement. Time was recorded for initial markers, model inference, and refinement.
Key Findings
- Internal validation (86 tasks): MedSAM generally outperformed SAM and achieved performance comparable to or better than specialist nnU-Net and DeepLabV3+ models. SAM performed relatively well on some RGB tasks (e.g., endoscopy polyp: median DSC 91.3%, IQR 81.2–95.1%) but poorly overall, especially on targets with weak boundaries. MedSAM showed a narrower DSC distribution across tasks, indicating robustness.
- External validation (60 tasks): MedSAM consistently delivered superior performance on new datasets and unseen targets. Specialist models did not consistently outperform SAM on external tasks, evidencing limited generalization. Example: right kidney segmentation in MR T1-weighted images showed median DSCs of 90.1% (SAM), 85.3% (U-Net), 86.4% (DeepLabV3+). MedSAM surpassed all. For nasopharynx cancer, MedSAM achieved median DSC 87.8% (IQR 85.0–91.4%), improving over SAM by 52.3% and over U-Net and DeepLabV3+ by 15.5% and 22.7%, respectively. MedSAM also improved up to 10% on some unseen modalities (e.g., abdomen T1 Inphase/Outphase).
- Generalization: On a multiple myeloma plasma cell dataset (distinct modality/task), MedSAM outperformed SAM despite no prior exposure.
- Data scaling: Training with larger datasets improved performance following a scaling rule; the full-scale (~1.57M pairs) model outperformed 10K and 100K variants on both internal and external validations.
- Annotation efficiency: With MedSAM-assisted pipeline, annotation time for 3D adrenal tumors was reduced by 82.37% and 82.95% for the two experts compared to fully manual annotation.
Discussion
The study addresses the need for a universal, promptable medical image segmentation model capable of handling diverse modalities, anatomies, and pathologies. By fine-tuning SAM on a large, heterogeneous medical dataset and leveraging bounding-box prompts, MedSAM achieves robust accuracy across 86 internal and 60 external tasks, including unseen targets and modalities. The consistent outperformance over SAM and competitive or superior results compared to specialist models demonstrate that a single foundation model can generalize effectively across settings where task-specific models often fail. These improvements enable more reliable computation of quantitative imaging biomarkers (e.g., tumor volumes) and can accelerate clinical workflows, as shown by substantial annotation time savings. The approach also provides a practical paradigm for adapting natural-image foundation models to medical and potentially other biological imaging domains.
Conclusion
MedSAM establishes a practical, universal foundation model for medical image segmentation by combining promptable design with fine-tuning on a large, diverse medical dataset. It delivers accurate, robust performance across modalities and tasks, often surpassing modality-specific and task-specific baselines, and substantially improves annotation efficiency. Future directions include mitigating modality imbalance by incorporating more data from underrepresented modalities, refining capabilities for complex structures such as vessels (potentially with improved prompts or architectures), and extending the paradigm to other biological imaging applications (e.g., cell and organelle segmentation).
Limitations
- Training data imbalance: CT, MRI, and endoscopy dominate the dataset, which may limit performance on underrepresented modalities like mammography.
- Prompt ambiguity for branching structures: Bounding-box prompts can be ambiguous for vessel-like structures (e.g., overlapping arteries and veins in fundus images), hindering precise segmentation.
- While MedSAM generalizes well, specialist or task-specific models may still excel in narrowly defined scenarios with abundant, homogeneous training data.
Related Publications
Explore these studies to deepen your understanding of the subject.