Introduction
Access to high-quality medical data is essential for advancements in medical research and practice, particularly with the rise of AI and machine learning in precision medicine. However, strict data protection laws like HIPAA and GDPR severely restrict access due to privacy concerns. Data anonymization, while a common approach, often significantly reduces data utility, rendering it unusable for research. Synthetic data generation offers a potential solution by creating artificial data that mimics the statistical properties of real patient data. Generative Adversarial Networks (GANs) have emerged as a powerful technique for generating realistic synthetic data. While synthetic data offers a promising avenue for privacy-preserving data sharing, recent studies highlight risks associated with over-reliance on it as a simple solution. A key concern is that a synthetic dataset closely mimicking the original might offer less privacy protection, while a significantly different dataset might provide strong protection but limited utility. The inherent black-box nature of GANs makes it difficult to predict data utility loss and the potential for sensitive information leakage. This scoping review addresses the lack of thorough scrutiny of synthetic data evaluation compared to anonymized data by focusing specifically on evaluation methodologies and the privacy-utility trade-off.
Literature Review
Several surveys have examined synthetic data generation and evaluation methods. Figueira et al. provided a comprehensive description of various generation methods, while Hernandez et al. explored evaluation methods and compared their performance. This review differs by focusing on the evaluation process and the privacy-utility trade-off, systematically examining how synthetic data is evaluated across 92 studies. Open-source solutions like Synthetic Data Vault, Table Evaluator, and TAPAS offer evaluation metrics and methods, but inconsistencies in terminology complicate the development of a unified perspective. Contradictory perspectives in the literature further complicate the understanding of synthetic data in healthcare.
Methodology
This scoping review followed the PRISMA-SCR methodology. A comprehensive search across IEEExplore, ACM Digital Library, PubMed, and Embase databases (from January 2018 to December 2022) was conducted. Inclusion criteria included publications describing synthetic data generation methods and their evaluation, focusing on structured data and published in English. Exclusion criteria included surveys, reviews, unstructured data, studies lacking assessment of utility/privacy, and poster abstracts. A common ontology encompassing methods and metrics for assessing synthetic data was created. Discrepancies in study selection were resolved through discussion and consensus. A data-charting form was used to extract specific variables from selected publications. A taxonomy of evaluation methods, organized into utility and privacy categories, was developed to standardize the analysis.
Key Findings
The review included 92 studies. After reconciling semantically similar methods, 48 utility and 9 privacy evaluation methods were identified. The number of publications increased significantly from 2018 to 2022, indicating growing interest in synthetic data. Most studies (70%) used cross-sectional data, while only 26% used longitudinal data. Approximately 35% of studies utilized GANs, with the remaining 65% using other methods. Most studies (94%) evaluated data utility. The most frequently used utility evaluation methods included: Multivariate Similarity: ML Classification Performance (33 instances); Univariate Similarity: Descriptive Statistics Comparison (25 instances); and Univariate Similarity: Marginal Distributions Visual Comparison (17 instances). The privacy aspect of synthetic data was the main incentive for most studies (80%), with the remaining using it for data augmentation. Of studies aiming for privacy preservation, only 38% included a privacy evaluation. The majority (84%) of those evaluating privacy relied on dataset-based evaluation, while a small number focused on model-based evaluation or performed both. There is no consensus on standardized metrics for evaluating the privacy and utility of synthetic data.
Discussion
The findings indicate a lack of consensus on standardized metrics for evaluating synthetic data's privacy and utility, hindering the fair comparison of different methods. This is particularly apparent in privacy evaluations, which are also hampered by a lack of clear legal guidance on how synthetic data privacy risk should be assessed. The under-evaluation of privacy is a significant concern, with over half of the studies claiming privacy preservation failing to conduct any formal evaluation. The choice of evaluation methods depends on research objectives, implementation complexity, interpretability, data structure, and model generalizability. Simpler methods are often preferred for their ease of implementation. Visual comparison methods are generally more attractive for stakeholder presentations.
Conclusion
This review highlights the need for standardized evaluation measures for synthetic health data to allow for fair comparisons between different generation methods. Robust privacy evaluations are crucial, especially in healthcare, to avoid data leaks and maintain trust. The complexity of methods like GANs can hinder adoption or lead to misinformed usage. International initiatives emphasize the urgent need for clear guidelines and reliable frameworks for synthetic data in healthcare. This review aims to stimulate a collaborative effort to establish best practices for safe and reliable synthetic data generation and evaluation.
Limitations
This scoping review may have missed some relevant studies or methods. Subjectivity in categorization and terminology consolidation might have introduced limitations. The 'domain-specific similarity' category encompasses diverse methods, potentially lacking precision. The terms 'fidelity' and 'utility' were sometimes used interchangeably, despite arguments suggesting they represent distinct metrics. This was addressed partially in the taxonomy, under ‘Domain Specific Similarity’.
Related Publications
Explore these studies to deepen your understanding of the subject.