logo
ResearchBunny Logo
Can We Trust Synthetic Data in Medicine? A Scoping Review of Privacy and Utility Metrics

Medicine and Health

Can We Trust Synthetic Data in Medicine? A Scoping Review of Privacy and Utility Metrics

B. Kaabachi, J. Despraz, et al.

Discover how researchers Bayrem Kaabachi, Jérémie Despraz, Thierry Meurers, Karen Otte, Mehmed Halilovic, Fabian Prasser, and Jean Louis Raisaro are tackling the privacy concerns surrounding health data sharing. Their scoping review reveals critical gaps in evaluating synthetic data, proposing essential steps for standardizing methods in medical research.

00:00
00:00
~3 min • Beginner • English
Introduction
Access to high-quality data is critical for impactful medical research and practice, particularly with the rise of AI and ML in precision medicine. Privacy regulations such as HIPAA and GDPR restrict access and sharing of medical data, and anonymization of high-dimensional data often reduces utility. Synthetic data, which mimics the statistical properties of real patient data (notably through GANs), is proposed as a promising alternative. However, over-reliance on synthetic data as a "silver bullet" poses risks, including potential membership inference due to model overfitting and memorization. There is a trade-off between utility and privacy: closer mimicry improves utility but reduces privacy. Given limited scrutiny of synthetic data privacy compared with anonymization literature, this scoping review aims to clarify how synthetic data is evaluated in healthcare, focusing on privacy and utility, and addressing two research questions: (1) Is there consensus on how to evaluate the privacy and utility of synthetic data? (2) Are privacy and utility given the same importance when assessing synthetic data?
Literature Review
Prior surveys have addressed synthetic data generation and evaluation in general. Figueira and Vaz described multiple generation methods, and Hernandez et al. explored evaluation methods and compared their performance. Open-source tools (Synthetic Data Vault, Table Evaluator, TAPAS) provide metrics for utility and privacy but use differing nomenclatures, complicating harmonization. The literature contains contradictory perspectives on synthetic data privacy and utility, and unlike anonymized data where attack models and protections are well studied, synthetic data has been less scrutinized. This review differentiates itself by systematically examining evaluation practice across 92 studies, emphasizing the privacy-utility trade-off and the lack of consensus on evaluation methods.
Methodology
Design: Scoping review guided by PRISMA/PRISMA-SCR principles to perform data collection and knowledge synthesis, including creation of a common ontology/taxonomy for privacy and utility evaluation methods. Timeframe: Publications from January 2018 to December 2022. Data sources: IEEExplore (n=123), ACM Digital Library (n=52), PubMed (n=57), Embase (n=75). Queries targeted synthetic data evaluation of utility or privacy and were refined to avoid false positives (e.g., synthetic materials). Searches last run August 14, 2023. Screening and eligibility: Duplicates removed (n=36). Reports assessed for eligibility (n=271). Exclusions: unstructured data (n=90), no evaluation (n=19), meta-analysis/survey (n=7), poster/abstract (n=11), not related (n=52). Included studies: n=92. Inclusion criteria: research using synthetic data generation methods and evaluating outputs; 2018–2022; structured data; English. Exclusion criteria: surveys/reviews, non-English, no assessment of generated output, unstructured data, poster abstracts. Data charting and synthesis: Disagreements resolved by author consensus. A data-charting form defined variables (e.g., broad/specific utility and privacy metrics, privacy type, added noise layer, adversary knowledge, SDG method). To ensure standardized extraction and enable quantitative synthesis, the team created a taxonomy of evaluation methods, with separate taxonomies for utility (univariate, bivariate, multivariate, longitudinal, domain-specific) and privacy (dataset evaluation vs model evaluation).
Key Findings
Scope and trends: 92 studies included. Interest increased over time: 4.35% (4/92) in 2018; 10.87% (9/92) in 2019; 43.48% (40/92) in 2022. Data types: 70% (64/92) cross-sectional; 26% (24/92) longitudinal temporal. Synthesis methods: 49 distinct SDG methods reported; approximately 35% (17/49) used GANs; remaining used statistical modeling and software (e.g., Synthpop, MDClone). Utility evaluation: 94% (86/92) evaluated utility. Across 48 utility methods identified: 17 univariate, 6 bivariate, 9 multivariate, 9 longitudinal, 8 domain-specific. Most frequent metrics: Multivariate—ML Classification Performance (33 uses); Univariate—Descriptive Statistics Comparison (25); Univariate—Marginal Distributions Visual Comparison (17). Privacy evaluation intent and practice: 80% (74/92; alternatively reported as 77/92 in figure narrative) aimed at privacy-preserving data sharing; 16% (15/92) for data augmentation; 4% (3/92) both. Among privacy-preserving intents (n≈77), 15 added protections (differential privacy or masking). Of the remaining 63 without added protection, only 24 (38%) conducted any privacy evaluation, indicating frequent blind trust in privacy. Among studies that evaluated privacy (n=24): 84% (20/24) performed dataset-based evaluation; 8% (2/24) model-based; 8% (2/24) both. Identified privacy methods total 9, including exact match, distance-based measures, holdout set distance, probabilistic disclosure risk estimation, shadow models, discriminator likelihood, gradient norm attacks, CRLProxy, and classification/regression tasks as indirect privacy probes. Overall: No consensus on specific metrics or metric families for utility or privacy; wide heterogeneity hampers comparability. Findings highlight under-evaluation of privacy relative to utility.
Discussion
The review answers RQ1 by showing a lack of consensus on standardized metrics for both utility and privacy evaluation of synthetic health data. The heterogeneity of methods prevents fair comparisons across SDG approaches and complicates adoption in real-world clinical contexts. Legal frameworks provide little concrete guidance on assessing synthetic data privacy risk, and proposed mappings to GDPR constructs (e.g., singling out, linkage, inference) remain unconfirmed. For RQ2, privacy is often under-evaluated relative to utility. Despite frequent claims of privacy-preserving goals, many studies perform no formal privacy assessment, and among evaluations conducted, dataset-level checks dominate over mechanism-level (model) evaluations. This imbalance risks misplaced trust and potential privacy breaches if synthetic data is assumed safe by default. Factors influencing method selection include research objectives (augmentation vs proxy use), implementation complexity (favoring simple statistical or distance-based measures), interpretability (visual comparisons and easily understood metrics), data structure (time series vs cross-sectional), and generalizability of attacks (GAN-specific methods like discriminator-based attacks vs model-agnostic approaches). Collectively, these results underscore the need for standardized, interpretable, and legally grounded evaluation frameworks that balance utility and privacy.
Conclusion
Synthetic data holds promise for enabling data sharing and advancing AI in healthcare, but adoption is hindered by the absence of standardized evaluation for utility and privacy. This scoping review provides a comprehensive taxonomy of evaluation methods and highlights a pervasive lack of rigorous privacy assessment, especially mechanism-oriented evaluations. The field requires harmonized, trustworthy, and legally aligned metrics and procedures to ensure safe, meaningful use. Initiatives calling for synthetic data underscore the urgency of developing clear guidelines and best practices. Implementations should proceed cautiously, avoiding over-promotion, and relying on methodical, unbiased evaluation.
Limitations
As a scoping review, some relevant studies or methods may have been missed. Taxonomy design and the consolidation of diverse metrics under unified terms introduce subjectivity and may reduce granularity. The domain-specific similarity category aggregates heterogeneous approaches that may not align perfectly. Terminology inconsistencies in the literature (e.g., "fidelity" vs "utility") complicate classification; while the review did not directly separate these constructs, the taxonomy attempts to reflect their distinction via domain-specific similarity.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny