Medicine and Health
Learning multi-site harmonization of magnetic resonance images without traveling human phantoms
S. Liu and P. Yap
Multi-site MRI studies, such as ABCD, ADNI, and AIBL, pool data across scanners and protocols to achieve large sample sizes and broad population coverage, but this introduces non-biological variability due to site-specific scanner and protocol differences. Prospective protocol harmonization is costly, must be done before data collection, and cannot correct existing datasets; even then, irreconcilable scanner differences can remain. Retrospective harmonization seeks to correct these differences post-acquisition. Existing approaches include statistics-based intensity normalization and batch-effect correction, which operate at whole-image level and may not preserve fine details, and learning-based image translation methods (machine learning and deep learning), which often require traveling human phantoms for supervision. Unsupervised deep learning methods avoid paired data but commonly learn pairwise mappings that scale poorly with the number of sites and fail to leverage global, multi-site information jointly. This study proposes a unified, unsupervised deep learning framework that eliminates the need for traveling human phantom data by disentangling site-invariant anatomical content from site-specific appearance styles, enabling simultaneous multi-site harmonization with a single model while preserving anatomy.
The paper reviews two major categories of retrospective MRI harmonization. (1) Statistics-based methods: intensity normalization and statistical/batch-effect adjustment align global intensity distributions but are limited in handling local, detail-specific variations. (2) Learning-based methods: machine learning regression with hand-crafted features and deep learning image-to-image translation with or without supervision. Supervised deep learning requires paired images from traveling human phantoms, which is impractical for large, multi-site, longitudinal studies. Unsupervised methods (e.g., CycleGAN variants) learn mappings from unpaired data but typically require pairwise training for each site pair, leading to N(N−1) mappings for N sites, poor scalability, and suboptimal use of multi-site information. Multi-domain approaches (e.g., StarGAN v2) offer scalability but often lack explicit content–style disentanglement, risking alteration of anatomical details during style transfer. DRIT++ offers disentanglement but is dual-domain and unscalable. The paper positions its method, MURD, as a multi-domain, explicitly disentangled, unsupervised approach that uses a single model for all sites.
Problem formulation: Multi-site MRI harmonization is posed as multi-domain image-to-image translation. The proposed multi-site unsupervised representation disentangler (MURD) learns to decompose each image into a site-invariant content representation (anatomy) and a site-specific style representation (appearance), then recombines them to reconstruct or harmonize images across sites.
Architecture: MURD implements content–style disentangled cycle translation (CS-DCT) with five modules: (i) a site-shared content encoder E^C to extract content features; (ii) a site-specific style encoder E^S for each site to extract style features; (iii) a site-shared generator G to synthesize images from content and style; (iv) a site-specific style generator G^S that maps a latent Gaussian code to site-specific style features; and (v) a site-specific discriminator D_j for adversarial training. Harmonization can be site-specific (style from G^S given a site label) or reference-specific (style encoded from a reference image of the target site). A 2.5D training strategy is used: three adjacent axial slices are stacked as three channels to leverage through-plane context while maintaining 2D network memory efficiency.
Training objectives: MURD jointly optimizes multiple losses to enforce disentanglement, realism, and fidelity: (1) Adversarial loss L_adv for each site discriminator to encourage realistic, site-specific outputs. (2) Content consistency loss L_cont to enforce similarity of content features across forward–backward translation. (3) Style consistency loss L_sty to maintain style consistency across translation when content and style are disentangled. (4) Content alignment loss L_ca (KL divergence regularization) to promote site-invariant content features in a common latent space, with random perturbations to encourage robustness. (5) Style diversity loss L_sd to ensure intra-site style variability by encouraging different latent codes to produce diverse appearances. (6) Cycle consistency loss L_cyc combining pixel-wise and gradient losses to reconstruct inputs after forward–backward translation, preserving fine anatomical details. (7) Identity loss L_id to enforce that combining an image’s own content and style reconstructs the input exactly, preventing anatomical changes under identity translation. The total loss is a weighted sum L = L_adv + λ_cont L_cont + λ_ca L_ca + λ_sd L_sd + λ_sty L_sty + λ_cyc L_cyc + λ_id L_id, with larger weights on cycle and identity to emphasize structural fidelity, and a small weight on content alignment; content and style consistency are weighted similarly to adversarial loss.
Datasets and preprocessing: Data are from the ABCD study, T1- and T2-weighted brain MRI of 9–10-year-old children, grouped by vendor (GE, Philips, Siemens) as three virtual sites. For each modality, training used 20 volumes per vendor. Three mutually exclusive test sets per modality: (1) Validation Dataset: 10 volumes per vendor, structurally distinct from training, to assess effectiveness beyond training. (2) Generalizability Dataset: 1000 volumes per vendor, to test scalability/generalization. (3) Traveling Human Phantom Dataset: subjects scanned across vendors (GE–Philips: 1 subject; GE–Siemens: 5 subjects; Philips–Siemens: 2 subjects). T1 and T2 images per subject were aligned with ANTs. For training/testing, 60 central axial slices per volume were used; three adjacent slices were merged into one 2.5D slice. Across modalities and vendors, >6000 volumes were used for evaluation.
Implementation: Implemented in TensorFlow; ADAM optimizer with learning rate 1e-4. Each channel normalized to [-1, 1] and zero-padded/cropped to 256×256. Inference: extract content from source image; obtain style either from target site’s style generator (site label + Gaussian latent) or from a reference image’s style encoder; generate harmonized image by G(content, style).
- Visual and distributional quality: Compared with DRIT++ (pairwise, disentangled) and StarGAN-v2 (multi-domain, style-based), MURD achieved lower Frechét Inception Distance (FID) and Kernel Inception Distance (KID) to target-site distributions. MURD’s FID/KID values were close to reference values computed between training and testing images from the same site, indicating highly realistic, site-consistent appearance. This held for both the Validation and Generalizability datasets, demonstrating strong generalization despite modest training size (20 volumes per site).
- Preservation of anatomy and improved appearance consistency (traveling human phantoms): On cross-scanner pairs, MURD substantially outperformed DRIT++ and StarGAN-v2 on mean absolute error (MAE), multi-scale structural similarity (MS-SSIM), and peak signal-to-noise ratio (PSNR), indicating better harmonization of contrast while preserving anatomical details. Evaluations included GE→Philips (n=60 slices), Philips→GE (n=60), GE→Siemens (n=600), Siemens→GE (n=600), Philips→Siemens (n=120), Siemens→Philips (n=120).
- Downstream segmentation: Using BET and FAST on the traveling human phantom dataset, Dice similarity coefficients (DSC) for WM/GM/CSF segmentation improved markedly after harmonization relative to unharmonized images, using the target site’s segmentation as reference. Identity translation (site-to-same-site) preserved DSC, evidencing anatomical fidelity.
- Volumetrics and biological signal preservation: In the Generalizability dataset (500 T1/T2 images per gender per site), site-dependent differences in WM, GM, and CSF volume distributions were reduced after harmonization to any target site, while within-site identity translation preserved distributions. Gender differences (larger volumes in males) present in raw data were preserved post-harmonization.
- Cross-resolution harmonization: Downsampled images (1.25 mm) harmonized to 1.0 mm reference images showed improved MAE, MS-SSIM, and PSNR, with qualitative detail recovery for both T1 and T2 (n=60,000 slices per site per modality from the Generalizability dataset).
- Continuous harmonization: Interpolating style features between sites produced smooth, gradual appearance changes without altering anatomical content, supporting effective content–style disentanglement.
- Scale and practicality: A single MURD model harmonizes among multiple sites simultaneously without traveling human phantoms and with modest training data, enabling retrospective harmonization of large-scale studies.
The study addresses the core challenge of multi-site MRI—non-biological inter-site variability—by explicitly disentangling anatomy (content) from site-specific appearance (style) in an unsupervised, multi-domain framework. Results show that MURD removes site effects (appearance) while preserving anatomical information, improving downstream tasks such as tissue segmentation and volumetric consistency. By using a single model across sites, MURD overcomes scalability limitations of pairwise methods (e.g., DRIT++), while explicit disentanglement mitigates the risk of anatomical alteration present in purely style-based multi-domain methods (e.g., StarGAN-v2). The approach supports reference-based and site-label-based harmonization, enables style diversity within sites, and allows continuous style interpolation, offering controllability and transparency. The 2.5D strategy yields spatially smooth 3D volumes with manageable memory. Collectively, findings demonstrate that MURD can be deployed for retrospective harmonization in large studies to enhance reliability and reproducibility of analyses without additional data collection.
MURD provides a unified, unsupervised, multi-site MR image harmonization framework that disentangles content and style to modify site-specific appearance while preserving anatomy. It uses a single model for simultaneous harmonization across sites, requires no traveling human phantoms, generalizes well from modest training data, and improves downstream segmentation and volumetrics while preserving biological differences. The method supports reference-specific and site-label-driven harmonization, cross-resolution detail recovery, and continuous appearance interpolation. Future directions include incremental adaptation to new sites via knowledge distillation to avoid retraining from scratch and incorporating images with atypical/abnormal appearances in training to ensure robust preservation of diagnostically relevant content.
Current evaluations were conducted on MR images of healthy individuals; atypical or pathological appearances (e.g., tumors) may interfere with harmonization. Although the unsupervised framework can incorporate such cases during training to preserve diagnostic content, explicit validation on diverse pathologies remains to be done. Additionally, while loss weighting emphasizes structural fidelity and disentanglement, exact weight settings (e.g., for content alignment) may affect performance and could require tuning across datasets.
Related Publications
Explore these studies to deepen your understanding of the subject.

