logo
ResearchBunny Logo
Can We Trust Synthetic Data in Medicine? A Scoping Review of Privacy and Utility Metrics

Medicine and Health

Can We Trust Synthetic Data in Medicine? A Scoping Review of Privacy and Utility Metrics

B. Kaabachi, J. Despraz, et al.

This research by Bayrem Kaabachi, Jérémie Despraz, Thierry Meurers, Karen Otte, Mehmed Halilovic, Fabian Prasser, and Jean Louis Raisaro tackles the critical balance between sharing health data for research and protecting privacy. The study presents a scoping review, uncovering a significant lack of standard evaluation methods for synthetic data's privacy and utility, urging for awareness and uniform approaches in medical data sharing.

00:00
00:00
~3 min • Beginner • English
Introduction
Access to high-quality data is critical for impactful medical research and practice, particularly with the rise of AI/ML and Precision Medicine. However, privacy laws such as HIPAA and GDPR constrain sharing of medical data. Traditional anonymization can severely reduce the utility of high-dimensional data. Synthetic data—artificial data generated to mimic statistical properties of real patient data, often via GANs—has been proposed to enable data sharing while protecting privacy. Despite enthusiasm and early successes replicating medical case studies with synthetic data, recent work highlights risks, including membership inference and model memorization. There is a trade-off: data that closely mimics originals tends to be useful but may be less private, and vice versa. The black-box nature of GANs makes it hard to anticipate what utility is lost and what sensitive information may leak. Prior surveys addressed generation methods and some evaluation comparisons, and open-source evaluation toolkits exist, but nomenclature and contradictory perspectives hinder harmonization. This scoping review focuses on evaluation methodologies for synthetic health-related data, emphasizing the privacy-utility trade-off across 92 studies, guided by two questions: (1) Is there consensus on evaluating privacy and utility? (2) Are privacy and utility given the same importance in assessment?
Literature Review
The paper situates its work within existing literature by noting prior surveys on synthetic data generation and evaluation (e.g., Figueira et al. on generation methods; Hernandez et al. on evaluation comparisons). It also references open-source evaluation platforms (Synthetic Data Vault, Table Evaluator, TAPAS) that provide metrics for utility and privacy but use differing terminologies, complicating standardization. The authors identify contradictory perspectives in the literature regarding privacy and utility of synthetic data and argue that, compared to anonymization research with extensive attack and protection literature, synthetic data has not been scrutinized as thoroughly. This motivates their systematic scoping review emphasizing evaluation processes and the privacy-utility trade-off.
Methodology
The authors conducted a scoping review following PRISMA-SCR principles. Searches spanned January 2018 to December 2022 across IEEExplore, ACM Digital Library, PubMed, and Embase to capture both technical (computer science) and healthcare-focused studies. Queries targeted titles and abstracts for synthetic data evaluation of utility or privacy, with strategies refined through team discussions. Searches were last executed on August 14, 2023. Records identified: IEEE (n=123), ACM (n=52), PubMed (n=57), Embase (n=75). Duplicates removed: n=36. Reports assessed for eligibility: n=271; none were unretrieved. Exclusions included unstructured data (n=90), no evaluation (n=19), meta-analyses/surveys (n=7), poster/abstract-only (n=11), and not related (n=52). Ultimately, 92 studies were included. Inclusion criteria: publications using synthetic data generation methods and evaluating outputs; 2018–2022; structured data only. Exclusion criteria: reviews/surveys; non-English; no assessment of utility/privacy; unstructured data; poster abstracts. Discrepancies in study selection were resolved by author consensus. A standardized data-charting form defined variables to extract (e.g., metrics, privacy types, SDG method), enabling quantitative synthesis. The team developed a taxonomy for evaluation methods, covering utility (univariate, bivariate, multivariate, longitudinal, domain-specific) and privacy (dataset- vs model-evaluation). Figures depict the taxonomies and PRISMA flow. The review also categorized data type (cross-sectional vs longitudinal), purposes (privacy-preserving data sharing vs augmentation), and generation methods (GANs, statistical modeling, software tools).
Key Findings
- Scope and trends: 92 studies included; interest increased over time (2018: 4.35% [4/92]; 2019: 10.87% [9/92]; 2022: 43.48% [40/92]). Most studies used cross-sectional data (70%, 64/92); 26% (24/92) used temporal/longitudinal data. - Generation methods: 49 distinct SDG methods identified; about 35% (17/49) used GANs; others included statistical modeling and software (e.g., Synthpop, MDClone). - Utility evaluation: 86/92 (94%) evaluated utility. Across 48 utility methods identified: 17 univariate, 6 bivariate, 9 multivariate, 9 longitudinal, 8 domain-specific. Most common metrics: ML classification performance (multivariate) used 33 times; descriptive statistics comparison (univariate) 25 times; marginal distributions visual comparison (univariate) 17 times. - Privacy evaluation: 9 privacy methods identified. Among studies aiming at privacy-preserving data sharing (about 80%: 74/92; text elsewhere cites 77/92), only 15 applied added protection (e.g., differential privacy or masking). Of the remainder (63), only 24 (38%) conducted at least one privacy evaluation, indicating frequent unverified assumptions of privacy. Among studies that evaluated privacy (n=24), 84% (20/24) used dataset-based evaluations, 8% (2/24) used model/mechanism-based evaluations (e.g., GAN-focused attacks or shadow models), and 8% (2/24) used both. Sankey diagram summaries also report: dataset-based evaluation (18), model-based (2), both (2); 39 did not evaluate privacy. - Overall landscape: No consensus on specific metrics or metric families for either utility or privacy; heterogeneity impedes comparability. The authors warn of an alarming tendency to trust synthetic data safety without adequate evaluation.
Discussion
The review addresses RQ1 by demonstrating a lack of consensus on standardized metrics for evaluating both utility and privacy. The proliferation of heterogeneous metrics and taxonomies across studies hinders fair comparison and selection of synthetic data generation methods for practical deployment. This fragmentation is pronounced for privacy assessments, where legal interpretations (e.g., GDPR-defined risks like singling out, linkage, inference) are not yet clearly mapped to accepted, validated metrics, though some attempts exist. The authors argue that as technology and regulatory frameworks mature, the field should converge on standardized, trustworthy evaluation approaches. For RQ2, the findings indicate that utility is prioritized over privacy in practice: while most studies extensively evaluate utility, far fewer conduct rigorous privacy assessments, even when privacy-preserving data sharing is the stated purpose. Many works rely on implicit assumptions of inherent privacy without empirical verification, risking inappropriate data sharing and potential breaches. Where privacy is evaluated, studies predominantly employ dataset-based methods rather than mechanism-level, worst-case risk assessments. The authors emphasize the need for balanced, thorough evaluation of both utility and privacy, particularly for complex models like GANs, to maintain stakeholder trust in critical domains like healthcare.
Conclusion
The scoping review synthesizes the state of evaluation for synthetic health-related data, revealing substantial heterogeneity and a lack of consensus on metrics for utility and privacy. The authors provide taxonomies for utility and privacy evaluation methods to structure the landscape and highlight gaps, notably the under-evaluation of privacy despite privacy-preserving goals. They call for standardized, validated metrics and guidelines to enable fair comparisons, foster adoption, and ensure safe use of synthetic data in healthcare. Future work should develop robust, legally informed frameworks for privacy risk assessment, improve transparency for complex generators (e.g., GANs), and build community consensus through initiatives and best-practice guidance to make synthetic data evaluation more accessible, reliable, and trustworthy.
Limitations
The review may have missed relevant studies or methods. Standardizing and categorizing diverse metrics across papers introduced subjectivity and could limit granularity. The broad “domain-specific similarity” category groups heterogeneous approaches and may not precisely fit each case. Terminology inconsistencies in the field (e.g., conflating fidelity and utility) complicate classification; although the taxonomy attempts to reflect distinctions, some ambiguity remains.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny