logo
ResearchBunny Logo
Introduction
Access to high-quality medical data is essential for research and practice, particularly with the rise of AI and machine learning in precision medicine. However, privacy regulations like HIPAA and GDPR restrict access to real patient data. Data anonymization, while a common approach, often compromises data utility, rendering it unusable for research. Synthetic data generation, particularly using Generative Adversarial Networks (GANs), offers a potential solution by creating artificial data that mirrors the statistical properties of real data without containing actual patient information. While promising, concerns exist regarding the potential for residual privacy risks in synthetic data, and the lack of a standardized evaluation approach hampers its widespread adoption. This study aims to systematically review existing methods for evaluating the privacy and utility of synthetic health data to identify gaps and propose future directions.
Literature Review
Several surveys have examined synthetic data generation methods and evaluation techniques. Figueira et al. (2022) provide a comprehensive overview of generation methods, while Hernandez et al. (2022) explore and compare evaluation methods. However, these studies lack a systematic analysis of the evaluation process and the privacy-utility trade-off. Existing open-source solutions like Synthetic Data Vault, Table Evaluator, and TAPAS offer evaluation metrics, but inconsistencies in terminology hinder a unified understanding. Contradictory perspectives in the literature further complicate the development of a consensus on synthetic data evaluation in healthcare.
Methodology
This scoping review followed the PRISMA-SCR methodology. A comprehensive search across IEEExplore, ACM Digital Library, PubMed, and Embase databases (from January 2018 to December 2022) was conducted using keywords focused on synthetic data generation and evaluation. Inclusion criteria included publications evaluating the privacy or utility of synthetic data generated from structured data, published in English between 2018 and 2022. Exclusion criteria encompassed surveys, reviews, unstructured data, studies lacking evaluation of synthetic data, and non-English publications. Two authors independently screened articles, resolving discrepancies through consensus. A common ontology was created to classify evaluation methods, categorized into utility and privacy taxonomies. The utility taxonomy includes methods for univariate, bivariate, multivariate, longitudinal, and domain-specific similarity. The privacy taxonomy categorizes methods into dataset-based evaluation and model-based evaluation. Data extraction involved a standardized form detailing specific variables from selected publications.
Key Findings
The review included 92 studies. 48 different methods were identified for evaluating the utility of synthetic data, and 9 for evaluating privacy. The most commonly used utility methods were: Multivariate Similarity: ML Classification Performance (33 instances), Univariate Similarity: Descriptive Statistics Comparison (25 instances), and Univariate Similarity: Marginal Distributions Visual Comparison (17 instances). Regarding privacy, a significant finding was the alarming tendency to trust the safety of synthetic data without proper evaluation. Of the 77 studies aiming to use synthetic data for privacy, 63 did not include any formal privacy evaluation. Only 24 out of 92 studies performed any privacy evaluation. From these 24 studies, 84% (20/24) used dataset-based evaluation, and only 8% (2/24) used model-based evaluation. There is no consensus on standardized metrics for evaluating either privacy or utility. The majority of included studies (70%) used cross-sectional data, with only 26% using longitudinal data. Approximately 35% of the studies utilized GANs for synthetic data generation, while the remainder employed a mix of other methods. Most studies (80%) aimed to use synthetic data for privacy-preserving data sharing, while 16% used it for data augmentation.
Discussion
The lack of consensus on standardized metrics for evaluating the privacy and utility of synthetic data is a major barrier to its adoption in healthcare. The wide variety of metrics employed makes it challenging to compare findings across studies. The significant under-evaluation of privacy is particularly concerning, indicating a need for a more balanced approach. The choice of evaluation methods is influenced by research objectives, implementation complexity, interpretability, data structure, and model generalizability. The study also notes the interchangeable use of "fidelity" and "utility," highlighting the need for clearer distinctions between these metrics.
Conclusion
This review highlights the need for standardized evaluation methods for both the privacy and utility of synthetic health data. The current lack of consensus hinders meaningful comparison and adoption of new techniques. Robust privacy evaluations are crucial, especially considering the sensitive nature of health data. Future research should focus on developing and validating uniform evaluation approaches that address both privacy and utility concerns, fostering trust and facilitating the responsible use of synthetic data in healthcare. International initiatives call for clear guidelines and reliable frameworks for synthetic data, emphasizing the need for this collaborative effort.
Limitations
This scoping review might not have captured all relevant studies or methods. Subjectivity in categorization of evaluation methods, particularly "domain-specific similarity," could exist. The interchangeable use of "fidelity" and "utility" in the literature was not explicitly addressed. The review also focuses on structured data, excluding studies involving images or text data.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny