logo
ResearchBunny Logo
Introduction
Healthcare departments possess vast amounts of patient data, but privacy concerns hinder its full utilization in machine learning for improved decision-making and outcomes. Synthetic data offers a solution by mimicking the statistical properties of real data without revealing sensitive information. This research explores the reliability of synthetic data for training supervised machine learning models in healthcare. The study aims to compare the performance of models trained on synthetic versus real data, assess the variance in accuracy differences between these models, determine how often the best machine learning technique changes when using synthetic versus real data, and evaluate the impact of statistical disclosure control (SDC) methods on data utility. This research addresses a critical gap in understanding the efficacy and reliability of synthetic data in healthcare, potentially enabling wider data sharing and accelerating the development of improved healthcare solutions.
Literature Review
Traditional data perturbation methods like data swapping and adding noise often fail to eliminate disclosure risk and may reduce data utility, particularly when multivariate relationships are involved. Synthetic data generation, first proposed by Rubin and Little, offers a more secure approach. Raghunathan et al. pioneered multiple imputation for synthetic data generation, followed by Reiter's nonparametric tree-based technique using CART. More recent approaches utilize Bayesian networks. While several synthetic data generators exist, empirical evidence of their efficacy in healthcare is limited. This study builds upon preliminary work to assess whether synthetic data preserves the complex patterns that machine learning can uncover from real data and whether it can serve as a viable alternative for developing eHealth apps and informing healthcare policy.
Methodology
Nineteen open healthcare datasets from the UCI Machine Learning Repository were used, encompassing various sizes and data types. Missing values were handled by removing features with many missing values or observations with missing feature values. Synthetic datasets were created for each using three generators: CART (Reiter's method), parametric (Raghunathan's method), and Bayesian networks (Ping et al.'s method). The R package Synthpop and the DataSynthesizer Python implementation were used. Five supervised machine learning models were trained: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine, using Python's Scikit-Learn library. Models were trained and tested on both real and synthetic datasets separately; however, testing was always performed on real data to evaluate the real-world performance of models trained on synthetic data. ShuffleSplit cross-validation with 10 iterations and a 75/25 train/test split was employed. Categorical attributes were one-hot encoded. Statistical disclosure control (SDC) methods, including minimum leaf size (CART), smoothing (numerical attributes), and unique record removal, were applied to assess their impact on data utility. Pairwise mutual information scores were calculated to compare multivariate relationships in real and synthetic datasets. Accuracy, precision, recall, and F1 score were used to evaluate model performance. A chi-square test was used to assess the significance of differences in winning classifiers between models trained on real and synthetic data, with and without SDC.
Key Findings
The study found that 92% (263/285) of models trained on synthetic data showed lower accuracy than those trained on real data when tested on real data. However, the mean absolute difference in accuracy was small, ranging from 0.058 (6%) for SVM to 0.193 (19%) for decision trees. Tree-based models (decision tree and random forest) exhibited larger deviations than other models. Precision, recall, and F1 scores also decreased when using synthetic data, with larger decreases observed in tree-based models and more variance in models trained with Bayesian network-generated data. The winning classifier (best-performing model) when trained and tested on real data matched that trained on synthetic data in only 26% (CART/parametric) and 21% (Bayesian network) of cases when all five models were considered. This improved to 74%, 53%, and 68% when tree-based models were excluded from the comparison. Statistical disclosure control methods did not notably affect data utility; the average decrease in accuracy remained small across all machine learning models and SDC techniques. Tree-based models consistently showed the largest decreases in accuracy, both with and without SDC.
Discussion
The findings demonstrate that synthetic data can be a viable alternative to real data for training machine learning models in healthcare, despite some performance reduction. The small, consistent differences in accuracy suggest this decrease is manageable. The increased sensitivity of tree-based models to synthetic data warrants further investigation. The minimal impact of SDC suggests a balance can be achieved between privacy protection and data utility. The study highlights the potential of synthetic data to facilitate data sharing and accelerate research, particularly in situations where access to real data is restricted due to privacy concerns.
Conclusion
This study shows that synthetic data can effectively train supervised machine learning models for healthcare applications, with small, manageable decreases in accuracy. Tree-based models showed greater sensitivity. SDC methods did not significantly reduce data utility. Future research should explore a wider range of algorithms and datasets, including real healthcare data, while also rigorously measuring disclosure risk and the trade-off between data utility and privacy.
Limitations
The study used open healthcare datasets, which may not fully represent the complexity and heterogeneity of real-world healthcare data. The range of machine learning algorithms was limited, and hyperparameter optimization was not extensively explored. Disclosure risk was not directly quantified, and a more detailed exploration of the trade-off between data utility and disclosure risk is needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny