Computer Science
The Beauty or the Beast: Which Aspect of Synthetic Medical Images Deserves Our Focus?
Y. Nan, F. Felder, et al.
The study addresses whether the visual fidelity of synthetic medical images correlates with their utility for downstream tasks such as classification. While synthetic data promises to alleviate data scarcity in medical imaging, assessment often relies on visual inspection or fidelity metrics (e.g., FID, IS), which may be subjective or inadequately reflect downstream performance. The authors hypothesize that fidelity and utility may not be positively correlated and propose a comprehensive evaluation across generative model families (VAE, GAN, diffusion) using standardized fidelity, variety, and utility metrics on pathological and chest X-ray datasets. The work aims to guide practitioners away from relying solely on image realism toward utility-aware synthetic data selection.
Commonly used synthesis quality metrics include Fréchet Inception Distance (FID) and Inception Score (IS), designed to gauge distributional similarity and perceptual quality. Precision/recall/F1 for generative models have been introduced to capture both fidelity (precision) and coverage/variety (recall). Prior work suggests these metrics align with human perception and are often used to select synthesis models. However, relationships between metric-based image quality and downstream utility are complex and possibly weak. VAEs (and VQ-VAE2) provide explicit latent modeling and diversity, but can yield blurrier reconstructions; GANs (e.g., StyleGAN2) deliver high realism but risk mode collapse and reduced variety; diffusion and latent diffusion models (LDMs) balance fidelity and variety but are computationally heavy. The study builds on these strands to empirically test the fidelity-utility link in medical imaging.
Models: Three families were evaluated—(1) VQ-VAE2 (two-level latent hierarchy; discrete latent quantization; trained per class for efficiency), (2) StyleGAN2 (conditional; class labels encoded into 512-d latent vector), and (3) Latent Diffusion Models (LDMs) operating diffusion in VQ-VAE latent space with 8× compression; conditional training.
Datasets: (a) Pathology: 252 H&E whole-slide images from BCSS and Lizard; cropped to 256×256 patches, categorized into 6 classes (inflammatory, necrosis, stroma, tumor, fat, gland). Total 18,703 patches split into train (7,874), validation (3,741), test (7,478). (b) Chest X-ray: Pediatric AP images (Kaggle chest X-ray pneumonia) resized to 512×512; split per original: train 5,216, val 16, test 624. Only training subsets were used to generate synthetic images. (c) Additional CT dataset: 3D scans converted to 2D montages (four axial slices per montage; 20 montages per volume). Total 52,540 montages; 26,270 for training and 26,270 for quality evaluation, to compare StyleGAN2 vs LDM across resolutions (256, 512, 1024) on grayscale structured data.
Fidelity and variety evaluation: Human fake identification rate (FIR) via two experts (one clinician, 5 years; one technician, 1 year) jointly labeling shuffled real vs synthetic small subsets (pathology: 60 images; X-ray: 50; CT: 50). Automated metrics: FID using 2048-d InceptionV3 features; precision/recall on the same feature space. Variety proxies also included lossless JPEG file size of the mean image (smaller file size indicates greater intra-class diversity).
Utility evaluation: Two use cases—(1) Data augmentation utility: synthetic images added to training sets; measure change in test classification accuracy using InceptionV3 classifier with standard augmentations (flip/rotate/contrast) as baseline. Significance assessed via paired Wilcoxon Signed Rank Test (p<0.05). (2) Feature extraction utility: pretrain InceptionV3 on fully synthetic datasets, then fine-tune only the last FC layer using 50% of real training data for 20 epochs; compare test accuracy to models trained on real data.
Efficiency: Report GPU memory for 8 images, inference time for generating 1000 images, and qualitative observations on training cost.
Bias analysis: For X-ray, investigated dataset bias (presence and position of image text “R”) and its inheritance by generative models. Conducted a cropped X-ray experiment removing text regions and reevaluated metrics and utility (Table IV).
- Fidelity vs utility decoupling: High-fidelity or visually realistic images did not guarantee higher utility. In some cases, lower-fidelity images supported better downstream performance.
- Pathology dataset (256×256):
- VQ-VAE2 had poor fidelity (FID 201.85) and was easily identified as fake (FIR 0.83); recall 0.16 indicated poor coverage; utility was negative: augmentation −5.73%, feature extraction −21.10% (significant).
- StyleGAN2 fooled humans (FIR 0.08) but showed mode collapse (high mean-image file size 114.67; recall 0.21). Utility: augmentation +0.08%, feature extraction −19.77% (significant drop).
- LDM achieved best automated fidelity (FID 62.56), balanced precision/recall (0.46/0.43), and delivered the only positive augmentation utility +3.64% (significant), while feature extraction was still below real-training baseline (−3.19%, significant). Traditional augmentations alone yielded +4.07%—comparable to LDM.
- X-ray dataset (512×512):
- VQ-VAE2: FIR 1.00, FID 45.59, precision/recall 0.00/0.00, augmentation +0.96%, feature extraction −22.27% (significant).
- StyleGAN2: Good fidelity (FID 9.85) and precision 0.68 but reduced recall 0.19; augmentation −1.92%; feature extraction +1.76% (appearing strong before bias control).
- LDM: Poor realism (FIR 0.80; FID 25.68) with low recall 0.04; augmentation −0.64%; feature extraction −0.80% yet still produced features comparable to real in some analyses.
- Bias inheritance and false utility: In X-ray normals, consistent “R” text region induced dataset bias. StyleGAN2 reproduced this bias, inflating feature extraction performance. After cropping text, StyleGAN2 feature extraction utility dropped markedly (−13.30% vs baseline), while LDM maintained comparable performance to real data and showed strong augmentation utility (+8.01%, significant), indicating that failing to capture bias improved robustness.
- Grayscale structured datasets (X-ray, CT) across resolutions (Table III): LDM struggled to synthesize realistic grayscale structured images (high FIR and poorer FID/recall), whereas StyleGAN2 handled higher resolutions (up to 1024×1024) with better fidelity and precision, though still with variety challenges.
- Efficiency: StyleGAN2 was most efficient (e.g., ~1 min/1000 images; ~1.19 GB for 8 images). LDM required far more memory and time (e.g., 39 min/1000 X-ray images; ~70.6 GB), raising practical concerns for large-scale medical synthesis.
The findings demonstrate that common fidelity-centric metrics (FID, human realism judgments, precision) and visual appeal are unreliable proxies for downstream utility in medical imaging tasks. GANs like StyleGAN2 can produce highly realistic images and even seemingly strong feature extractors due to inheriting dataset biases (e.g., consistent text markers), which undermines robustness and generalization. Conversely, diffusion models (LDM) that fail to replicate such biases may yield less realistic images but more robust features and better augmentation utility in bias-controlled settings. For pathology images, LDM balanced fidelity and variety enough to yield modest, statistically significant augmentation gains, comparable to classical data augmentation alone, questioning the added value given LDM’s computational overhead. On grayscale structured data (X-ray, CT), diffusion models struggled in realism and coverage, while StyleGAN2 managed fidelity but suffered variety/mode collapse and bias inheritance. Overall, utility must be empirically assessed via downstream tasks rather than inferred from fidelity alone.
The study provides an empirical evaluation across VQ-VAE2, StyleGAN2, and LDM on pathology and chest X-ray datasets, showing that fidelity and visual realism do not reliably predict utility for downstream classification. Low-fidelity images can outperform high-fidelity ones, particularly when high-fidelity models inherit dataset biases. Diffusion models offered the best augmentation gains on pathology but at high computational cost and with limited realism on grayscale structured data. The authors advocate for utility-aware, explainable synthesis approaches and emphasize that downstream performance evaluation is essential before deploying synthetic data. Future work should develop bias-aware, utility-optimized generative modeling, broader modality coverage, scalable training/inference, and standardized, task-relevant evaluation frameworks.
- Limited datasets and modalities: primary evaluation on one pathology patch dataset and one pediatric chest X-ray dataset, with an auxiliary CT montage dataset. Generalizability to other modalities, pathologies, and multi-institutional distributions is unproven.
- Human fidelity assessment involved only two experts and small subsets, which may limit reliability.
- Utility assessed for classification with InceptionV3; results may vary with different architectures, tasks (e.g., detection, segmentation), or training regimes.
- Diffusion model training and inference were resource-intensive; some architectural/hyperparameter choices could impact results.
- Bias analysis focused on a specific visible text artifact; other latent biases may remain undetected.
Related Publications
Explore these studies to deepen your understanding of the subject.

