Medicine and Health

The Medical Segmentation Decathlon

M. Antonelli, A. Reinke, et al.

Discover how international challenges like the Medical Segmentation Decathlon are reshaping biomedical image analysis! This groundbreaking study, conducted by renowned authors, highlights how performance on diverse tasks predicts the ability to generalize to unseen tasks, paving the way for non-experts to train accurate AI segmentation models.... show more

Introduction

The study addresses whether a single, general-purpose machine learning algorithm can accurately perform medical image segmentation across diverse tasks and modalities without human intervention or task-specific tuning. Medical image segmentation is a critical prerequisite for numerous clinical applications (e.g., radiotherapy planning, treatment response monitoring) but the proliferation of thousands of segmentation algorithms and task-specific challenges makes selecting a baseline architecture difficult. The authors propose and organize the Medical Segmentation Decathlon (MSD), comprising ten heterogeneous datasets across anatomies, modalities, and target regions, to test the hypothesis that methods performing well on multiple tasks will generalize to unseen tasks. The challenge design included a development phase (seven known tasks) and a mystery phase (three unseen tasks), with strict submission policies to prevent overfitting, to rigorously assess generalizability.

Literature Review

The paper situates MSD within the context of biomedical image analysis challenges (e.g., BraTS for brain tumors, LASC for left atrium, LiTS for liver tumors), which have become the standard for benchmarking algorithms on specific tasks. It notes the dominance of semantic segmentation challenges (≈70% of challenges) and the widespread use of architectures like U-Net, with thousands of algorithms published annually. Prior work highlights the importance and limitations of common metrics (e.g., DSC), the need to interpret challenge rankings cautiously, and the prevalence of deep learning methods. The authors reference previous datasets and challenges incorporated into MSD (BraTS 2016/2017, LASC 2013, LiTS 2017) to emphasize continuity and comparability while addressing generalizability across tasks.

Methodology

Study design: An international challenge (MSD) was organized at MICCAI 2018 with two phases designed to measure algorithmic generalizability.

Development phase (known tasks): Seven datasets (brain, heart, hippocampus, liver, lung, pancreas, prostate) with released training labels; participants developed general-purpose algorithms, trained per task without human-in-the-loop adjustments, and submitted predictions on test sets. One submission per day was allowed.
Mystery phase (unseen tasks): Three additional datasets (colon, hepatic vessels, spleen). Only one final submission per algorithm was allowed; teams could train their previously developed methods on the new tasks without changing architecture or hyperparameters.

Data: Ten datasets total, 2,633 images across multiple institutions and modalities; each dataset had 1–3 ROI targets (17 total). Data were de-identified, standardized to NIFTI, reoriented to a consistent frame, and intensity scaled for non-quantitative modalities. Splits were approximately two-thirds training and one-third test (except BraTS brain tumor and LiTS liver datasets, which retained original splits). All datasets licensed under CC-BY-SA 4.0 and available at medicaldecathlon.com (training); test labels not released due to ongoing live challenge.

Development phase datasets and targets: Brain (mp-MRI; edema, enhancing and non-enhancing tumor; 750 4D volumes, 484/266); Heart (MRI T1w; left atrium; 30 3D volumes, 20/10); Hippocampus (MRI T1w; anterior/posterior; 394 3D volumes, 263/131); Liver (CT portal-venous; liver and liver tumor; 210 3D volumes, 131/70); Lung (CT; lung tumor; 96 3D volumes, 64/32); Pancreas (CT; pancreas and pancreatic mass; 420 3D volumes, 282/139); Prostate (mp-MRI T2, ADC; PZ and TZ; 48 4D volumes, 32/16).
Mystery phase datasets and targets: Colon (CT; colon cancer primaries; 190 3D volumes, 126/64); Hepatic vessels (CT; hepatic vessels and hepatic tumor; 443 3D volumes, 303/140); Spleen (CT; spleen; 61 3D volumes, 41/20).

Evaluation metrics: Two 3D semantic segmentation metrics were used:

Dice Similarity Coefficient (DSC) for overlap.
Normalized Surface Dice (NSD) as a distance-based metric with clinical tolerance per task (mm): Brain 5; Heart 4; Hippocampus 1; Liver 7; Lung 2; Prostate 4; Pancreas 5; Colon 4; Hepatic vessel 3; Spleen 3.

Ranking methodology: A significance ranking similar to BraTS was pre-specified and released before the challenge. For each task/ROI and metric, per-case performance was computed; Wilcoxon signed-rank pairwise tests between algorithms determined which algorithms were significantly worse (p≤0.05, unadjusted). For each algorithm and ROI, the significance score equaled the number of competitors significantly worse; higher scores yield better ranks. For tasks with multiple ROIs, ranks were averaged per task; final phase score was the average across tasks. Bootstrapping (1,000 samples) assessed rank stability using Kendall’s τ correlation via the challengeR toolkit (R, v1.0.2). Code and metric implementations were made publicly available.

Top methods (brief):

nnU-Net: Self-configuring U-Net-based pipeline with automated adaptations (preprocessing, architecture, postprocessing) per task; uses leaky ReLU, instance normalization, strided downsampling; extensive augmentations; combined DSC + cross-entropy loss; Adam optimizer; ensembles of four architectures chosen via cross-validation.
NVDLMED: Fully-supervised uncertainty-aware multi-view co-training; initializes from 2D pre-trained models; three anisotropic 3D ResNet views (coronal, sagittal, axial); DSC loss; SGD optimizer; resampling to handle inter-task differences; ensemble of three view-specific models.
K.A.V.athlon: AutoML-style generalized pipeline; hybrid V-Net/U-Net with SE and residual blocks; augmentations (affine, noise, flips, crops, blurring); DSC loss with thresholded ReLU (0.5); Adam optimizer; no ensembling.

Submission infrastructure and policies: Development phase allowed daily submissions with automated validation on grand-challenge.org; mystery phase allowed only one valid submission per algorithm to limit overfitting. Identity verification and fraud detection measures were discussed and later enhanced; containerized inference (Docker) was proposed as an improved future safeguard and is now supported in partnership with AWS and NVIDIA.

Key Findings

Participation and methods:

180 teams registered; 31 submitted valid development-phase results; 19 submitted valid mystery-phase results.
All qualifying methods used CNNs; U-Net was the most common base architecture (64%). Losses: DSC (29%) and cross-entropy (21%) predominated. Optimizers: Adam (61%) and SGD (33%).

Performance distributions:

Median of the mean DSC across tasks (over all participants) ranged from 0.16 (colon cancer segmentation; mystery phase) to 0.94 (liver; development phase) and 0.94 (spleen; mystery phase).

Challenge rankings (median (IQR) DSC):

Development phase: 1) nnU-Net 0.79 (0.61, 0.88); 2) K.A.V.athlon 0.77 (0.58, 0.87); 3) NVDLMED 0.78 (0.57, 0.87).
Mystery phase: 1) nnU-Net 0.71 (0.58, 0.82); 2) NVDLMED 0.69 (0.55, 0.79); 3) K.A.V.athlon 0.67 (0.49, 0.80).
Rank stability (Kendall’s τ, median (IQR)): colon 0.94 (0.91, 0.95); hepatic vessel 0.99 (0.98, 0.99); spleen 0.92 (0.89, 0.94), indicating robust rankings under bootstrapping.

Generalizability and resilience:

nnU-Net exhibited minimal rank variability across ROIs and tasks (rank range 1–4 in development phase) and consistently ranked first across bootstraps and tasks.
For difficult targets (e.g., pancreas tumor mass), overall median mean DSC was low (0.21), but nnU-Net achieved higher performance (0.52), showing stronger resilience to task difficulty; for liver ROI, nnU-Net reached 0.93 median DSC.
In the mystery phase, overall medians were lowest for colon cancer (0.16) and highest for spleen (0.94); nnU-Net achieved 0.56 (colon) and 0.96 (spleen).

Post-challenge impact and trends:

In the subsequent 2 years, nnU-Net competed in 53 further segmentation tasks, winning 33, with median rank 1 (IQR 1–2), including BraTS 2020; many top challenge solutions across 2019–2020 were nnU-Net derivatives.
Rolling (live) MSD challenge saw improved performance on both hard and easy tasks (e.g., spleen median DSC improved from 0.94 to 0.97) with later methods surpassing 2018 winners.
Two major trends emerged: continued refinement of self-configuring heuristics (e.g., nnU-Net) and the rise of Neural Architecture Search (NAS), which improved accuracy at higher computational cost.

Common traits of top methods:

Use of ensembles, intensity/spatial normalization augmentations, Dice-based losses, Adam optimizer, and post-processing (e.g., small region removal). Architectural tweaks were less critical than pipeline configuration, augmentation, and validation strategy.

Discussion

Findings demonstrate that a single, fully automated segmentation framework can generalize across diverse medical imaging tasks and modalities, achieving state-of-the-art performance without task-specific manual tuning. The robust correlation between development and mystery phase rankings suggests limited overfitting and supports the use of multi-task performance as a surrogate for generalizability. The analysis indicates that pipeline configuration (data preprocessing, augmentation, cross-validation, and post-processing) and training strategy can influence performance more than major architectural changes, given the prevalence and success of U-Net variants. The challenge infrastructure highlighted practical concerns for fair benchmarking (overfitting control, identity verification). Moving towards containerized, organizer-run inference can enhance integrity and reproducibility. The dataset’s heterogeneity and broad availability enable continued benchmarking and research in generalizability and domain adaptation. However, metric choice (DSC, NSD) reflects tradeoffs—these are task-agnostic and stable but may not align with specific clinical endpoints for small structures. Overall, the MSD results validate the hypothesis that consistent multi-task performance predicts generalization to unseen tasks and inform best practices for method development (ensembling, augmentation, robust validation).

Conclusion

The MSD challenge shows that fully automated semantic segmentation methods can achieve state-of-the-art accuracy across a wide range of tasks without manual parameter tuning, validating the hypothesis that strong multi-task performance predicts generalization to unseen tasks. The winning framework (nnU-Net) generalized successfully across numerous subsequent challenges, often outperforming task-specific solutions. Methodological advances (e.g., NAS, refined heuristics) continue to improve performance. To broaden clinical adoption, tools should be packaged with user-friendly interfaces and simplified installation while addressing ongoing issues such as domain shift and label quality. Future work should further explore automated pipeline configuration, robust evaluation under domain variability, clinically meaningful metrics, and secure, containerized evaluation infrastructures.

Limitations

Dataset annotations were produced by single raters without inter-rater reliability measurements, limiting assessment of human-level variability and potentially affecting label quality.
Retrospective, multi-institutional data introduced heterogeneous imaging protocols and annotation procedures.
The dataset focused on radiological imaging; findings may not generalize to other domains (e.g., dermatology, pathology, ophthalmology).
One region (hepatic vessel annotations in the liver dataset) was found suboptimal post-release; kept unchanged to maintain challenge integrity, but could affect segmentation evaluation.
Metrics (DSC, NSD) are task-agnostic and not necessarily optimal for specific clinical use cases, especially for very small structures.
At the time, organizers lacked computational resources to enforce containerized inference for all participants, which could have better controlled overfitting or manual post-processing.
Submission constraints attempted to reduce overfitting but some attempts to circumvent policies occurred; later infrastructure improvements (identity verification, Docker support) address this.

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Space Sciences

Collection of biospecimens from the inspiration4 mission establishes the standards for the space omics and medical atlas (SOMA)

E. G. Overbey, K. Ryon, et al.

Computer Science

The Beauty or the Beast: Which Aspect of Synthetic Medical Images Deserves Our Focus?

Y. Nan, F. Felder, et al.

Social Work

Peer reviewers' dilemmas: a qualitative exploration of decisional conflict in the evaluation of grant applications in the medical humanities and social sciences

G. Vallée-tourangeau, A. Wheelock, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny