logo
ResearchBunny Logo
Non-Reference Quality Assessment for Medical Imaging: Application to Synthetic Brain MRIs

Medicine and Health

Non-Reference Quality Assessment for Medical Imaging: Application to Synthetic Brain MRIs

K. V. E. Risager, T. Gholamalizadeh, et al.

Discover groundbreaking research by Karl Van Eeden Risager, Torkan Gholamalizadeh, and Mostafa Mehdipour Ghazi that transforms the assessment of brain MRI quality using a novel, non-reference deep learning approach. This innovative technique evaluates artifacts in MRI scans and generates high-fidelity synthetic 3D images, providing crucial insights for enhancing medical imaging standards.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the need for reliable, high-quality synthetic brain MRIs to tackle domain adaptation, labeled data scarcity, and privacy constraints in medical imaging. While diffusion models (e.g., DDPM) and recent 3D Wavelet Diffusion Models (WDM) have improved 3D image synthesis, comprehensive quality assessment of generated 3D MRI remains limited. Existing reference-based metrics (SSIM, PSNR) require paired images, and group-based measures (FID) do not assess individual image quality. Non-reference metrics (BRISQUE, PIQE, NIMA, GIQA) are designed for 2D natural images and fail to capture MRI-specific artifacts (e.g., motion, bias field, complex noise). The authors propose a comprehensive, non-reference method tailored to brain MRI: a 3D ResNet-50 regressor trained to predict quality scores (0–1) for six artifact types (contrast change, bias field, Gibbs ringing, motion ghosting, Rician noise, blur). They also train a diffusion model to generate synthetic 3D MRIs and evaluate both real and synthetic images with their assessor, aiming to enable clinically meaningful assessment and to guide improvements in generative models.
Literature Review
The paper reviews limitations of commonly used image quality metrics for generative medical imaging. Reference-based metrics such as SSIM and PSNR require paired ground truth images, which are often unavailable for synthetic images. FID assesses distributional similarity between sets of images and cannot evaluate individual image quality. Non-reference metrics like BRISQUE and PIQE, as well as learning-based methods (NIMA, GIQA), are developed for 2D natural images and do not capture MRI-specific artifacts (e.g., motion, bias field inhomogeneity, Rician noise). The authors highlight the gap for medical imaging–specific, non-reference quality assessment, particularly for 3D MRI, motivating their artifact-specific, non-reference approach.
Methodology
Overview: The authors train a 3D deep regression network (3D ResNet-50) with six outputs to estimate image quality with respect to six MRI artifact types. Training uses high-quality reference MRIs augmented on-the-fly with simulated distortions covering a spectrum of severities. Scores are scaled to [0,1], where higher indicates better quality. Generative Model (for synthetic data evaluation): A 3D Wavelet Diffusion Model (WDM) is used to synthesize 3D MRIs. Images are decomposed via discrete wavelet transform (DWT) into eight half-resolution coefficients, which are noised/denoised using a denoising diffusion probabilistic model. The model is optimized on denoised coefficients and reconstructed to full resolution via inverse DWT. Artifact Simulations and Quality Metrics (targets for supervision): - Contrast Change: Gamma transform J = I^γ with γ ∈ [0.5, 2]. Quality measured by standard deviation ratio (SDR) στ/σι, constrained to [0,1]. - Bias Field: Multiply images by an elliptic gradient field (random centers, radii 224) to simulate inhomogeneity. Quality measured by coefficient of variation ratio (CVR) σ_I/μ_I over σ_J/μ_J. - Gibbs Ringing: Truncate k-space edges after FFT along each axis with cutoff fc ∈ [32, 224] for 224³ volumes. Quality measured by truncation ratio fc/224. - Motion Ghosting: Randomly modulate k-space lines with factor α ∈ [0.35, 1] along axes. Quality measured as α = min(F_I/F_J) using FFT magnitudes. - Rician Noise: Add zero-mean Gaussian noise to real/imag components. Noise level ∈ [1e-6, 1e-2]. Quality measured by PSNR = 10 log10(1/MSE), then downscaled by 100 and clipped to [0,1]. - Blur Effect: Apply either down/upsampling (linear interpolation with scale ∈ [0.2, 2]) or Gaussian smoothing (kernel ∈ [3,11], sigma ∈ [0.25,5]). Quality measured as ratio of high-frequency components above threshold T (1/1000 of max frequency) in k-space between J and I. Quality Network Architecture and Loss: 3D ResNet-50 with six regression outputs (Sigmoid activations for [0,1] range). Initially trained with MSE; to better emphasize low-score cases, they adopted a focal MSE loss with parameters α=2 and γ=1 (M=6 outputs). Full-size inputs of 224×224×224 are used. Data Augmentation and Inference: On-the-fly geometric augmentations include translations (±10 voxels), rotations (±10°) about three axes, flips, elastic deformations (smoothed random displacement, Gaussian smoothing in [20,30], scaled by a ∈ [200,500]), and skull stripping via random morphological cropping. At inference, flipping is applied and predictions are averaged for robustness. Data and Preprocessing: T1-weighted MRIs from multiple datasets (ADNI, OASIS, Hammers, IBSR, IXI, SynthRAD, BraTS) were normalized to [0,1], resampled to 1 mm isotropic, and padded/center-cropped to 224³. ADNI provided 503 high-quality references for training the quality network; other datasets were used for evaluation. WDM was trained/validated/tested on ADNI, BraTS, and SynthRAD with hyperparameters from prior work, up to 6×10^5 iterations, validating with VGG16-FID. The quality network was trained with Adam for 50 epochs, base LR and weight decay 1e-4, piecewise LR schedule with drop 0.9, gradient decay 0.9, squared gradient decay 0.99.
Key Findings
- Accuracy on ADNI with synthetic distortions (Table 1): Predicted quality scores closely matched ground-truth targets across artifacts. Example means (GT → Pred): Contrast 0.771→0.796, Bias 0.915→0.960, Ring 0.553→0.570, Ghost 0.625→0.646, Noise 0.466→0.513, Blur 0.605→0.691, Mixed 0.699→0.732. MSEs remained low (e.g., Ghost 0.003±0.004; others ~0.007–0.011), with larger discrepancies particularly for blur where interpolation/smoothing may overlap with other distortions. - Generalization to external datasets (Table 2): On undistorted datasets, Proposed mean±SD scores were high: ADNI 0.955±0.017, IXI 0.955±0.025, OASIS 0.961±0.006, Hammers 0.945±0.020, IBSR 0.935±0.013; BRISQUE ranged ~38–45. On distorted versions, Proposed scores decreased as expected (e.g., ADNI 0.732±0.070; IXI 0.699±0.064; OASIS 0.719±0.066; Hammers 0.699±0.072; IBSR 0.701±0.081). SSIM on distorted sets was moderate (e.g., 0.545–0.670), BRISQUE increased (worse), PSNR decreased (e.g., ~17.8–27.2 dB). SSIM/PSNR not applicable to original sets without references, highlighting the advantage of the proposed non-reference approach. - Evaluation of synthetic images (Table 3): Proposed scores aligned with SSIM, BRISQUE, PSNR, and FID while offering interpretable [0,1] outputs. Results: ADNI 0.871±0.017 (FID 196.01), BraTS 0.917±0.013 (FID 157.57), SynthRAD 0.891±0.019 (FID 191.58). Corresponding SSIMs: 0.670, 0.896, 0.727; PSNRs: 17.35, 19.61, 17.74 dB. - Artifact-specific insights for synthetic data (Table 4): Common issues differed by dataset. ADNI samples showed lower noise scores (0.780±0.040). BraTS samples exhibited lower blur (0.837±0.019). SynthRAD samples showed lower ghosting (0.688±0.081) and strong blur scores (0.944±0.021). Contrast scores were generally high across datasets (≥0.928). These detailed assessments can guide targeted improvements (e.g., loss terms) in generative models. - Qualitative examples (Figure 1) from real IBSR and synthetic ADNI with corresponding low scores visually corroborated the assessor’s sensitivity to ringing, blur, contrast, and noise artifacts.
Discussion
The proposed non-reference, artifact-aware quality assessment accurately estimates MRI quality across multiple distortion types and generalizes from ADNI to several external datasets without relying on paired references. This directly addresses the limitations of SSIM/PSNR in generative settings and the inadequacy of 2D natural-image NR metrics for MRI. The [0,1] scoring provides intuitive, per-image interpretability and aligns with established metrics (SSIM, BRISQUE, PSNR, FID) while offering finer-grained, artifact-specific feedback. In evaluating synthetic images from diffusion-based generators, the method reveals dataset-specific weaknesses (e.g., noise in ADNI, blur in BraTS), enabling actionable guidance such as incorporating artifact-penalizing loss components or data/model adjustments to improve generative quality. The results indicate that a single trained 3D ResNet-50 can provide robust, clinically relevant quality indicators across diverse scanners, field strengths, and preprocessing pipelines.
Conclusion
The study introduces a comprehensive, non-reference quality assessment framework for 3D brain MRIs using a 3D ResNet-50 regressor trained on artifact-simulated data. It outputs interpretable [0,1] scores for six common MRI artifact dimensions and demonstrates strong agreement with ground truth distortions, superior generalization across multiple datasets, and consistency with established metrics. By training and evaluating a 3D wavelet diffusion model, the work shows the assessor’s utility for synthetic image evaluation and for diagnosing artifact-specific weaknesses, thus informing improvements to generative models. Future work could extend to additional MRI modalities, more artifact types, and integration into training loops for generative models to further optimize image quality in a targeted manner.
Limitations
- Blur-related discrepancies: The model shows notable confusion where interpolation and smoothing artifacts overlap, leading to higher estimation errors for blur compared to other artifacts. - Training data scope: The quality network was trained on ADNI references; although it generalized well to multiple external datasets, broader training across more diverse sources and acquisition protocols could further enhance robustness. - Evaluation context: Many evaluated images were preprocessed and high quality; performance in raw clinical workflows with wider artifact severity and variability was not exhaustively characterized.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny