logo
ResearchBunny Logo
Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

This study, conducted by Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, and Gorka Epelde, reveals insights about the performance of machine learning models trained on synthetic healthcare data. It shows that while synthetic data can be useful, real data still holds a significant edge in accuracy, particularly with tree-based models. The research underlines the balance between privacy and data utility.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether fully synthetic data can serve as a reliable alternative to real, sensitive health care data for developing supervised machine learning models. Motivated by significant privacy and governance barriers that restrict access to health data, the authors aim to quantify the performance differential between models trained on synthetic versus real data, assess how often the best-performing (winning) classifier aligns across real and synthetic training conditions, and evaluate the impact of statistical disclosure control (SDC) measures on data utility. The work’s importance lies in enabling broader, faster, and privacy-preserving development of data-driven tools for clinicians and policy makers by using synthetic datasets that preserve key statistical properties while minimizing disclosure risks. The authors also present a conceptual pipeline for integrating synthetic data sharing into health care provider environments to accelerate analytics while validating final models on real data within secure infrastructures.
Literature Review
The background reviews the evolution of synthetic data as a privacy-preserving alternative to traditional deidentification and perturbation techniques (eg, swapping, masking, suppression, noise addition), which may inadequately protect privacy and degrade multivariate utility. Foundational approaches include Rubin and Little’s proposals and multiple imputation–based parametric synthesis (Raghunathan et al.) and Reiter’s nonparametric CART-based synthesis; more recent work includes differentially private Bayesian network synthesis (Ping et al.). The paper highlights extensive prior deployment of synthetic data by the US Census Bureau (eg, SIPP Synthetic Beta, Synthetic Longitudinal Business Database, OnTheMap) and UK longitudinal studies to broaden access while maintaining privacy, with processes for validating analyses against secure “gold standard” data. Despite success in economics and business research, similar large-scale health care applications are limited. The authors argue for assessing synthetic data’s validity and disclosure risk in health settings, given increasing AI/ML capabilities and the difficulty of timely data access for research. Prior evidence of machine learning with synthetic data is noted but limited in breadth and systematic evaluation, motivating this multi-dataset assessment and consideration of SDC measures.
Methodology
Datasets: Nineteen open health datasets from the UCI Machine Learning Repository were selected, encompassing both categorical and numerical attributes with varying sizes and class structures. Missing values were handled by removing features with high missingness or observations with missing values. Synthetic data generation: Three public implementations were used, representing well-established synthesis paradigms: - Parametric synthesis (R synthpop): attributes synthesized sequentially by type; numerical via normal linear regression, multi-level categorical via polytomous logistic regression, and binary via logistic regression. The first attribute draws a random sample from observed data when no predictors exist. - Nonparametric CART synthesis (R synthpop): attributes synthesized sequentially using CART models fitted on previously synthesized predictors; conditional distributions derived from fitted trees are sampled to generate synthetic values. - Bayesian network synthesis (DataSynthesizer, Python): learns a differentially private Bayesian network capturing inter-attribute dependencies and generates samples from the model. Preservation of relationships: For each dataset and its synthetic counterparts, normalized pairwise mutual information scores were computed for all attribute pairs to visualize and assess preservation of multivariate associations. Machine learning evaluation: For each dataset, five classifiers were trained: stochastic gradient descent (SGDClassifier, hinge loss), decision tree (DecisionTreeClassifier; Gini, max_depth=10, random_state=0), k-nearest neighbors (KNeighborsClassifier; n_neighbors=10, leaf_size=30, weights=uniform, p=2, metric=minkowski, n_jobs=2), random forest (RandomForestClassifier; Gini, max_depth=10, min_samples_split=2, n_estimators=10, random_state=1), and support vector machine (SVC; RBF kernel, C=1.0, degree=3, probability=True). Categorical variables were one-hot encoded. Evaluation used ShuffleSplit with 10 iterations and 75/25 train/test splits. Benchmarks were trained and tested on real data; synthetic-trained models (for each generator) were always tested on real data to simulate deployment. Metrics: accuracy, precision, recall, F1; mean absolute differences relative to real-trained benchmarks were computed. The “winning classifier” per dataset/training condition was the highest-accuracy model. Statistical Disclosure Control (SDC): To further mitigate disclosure risks, the following rules-based SDC measures were applied where applicable and their impact on utility assessed: - Minimum leaf size (CART synthesis only): final leaf node size constrained to 10 to avoid small leaves that may reproduce near-real records. - Smoothing (numeric attributes only): Gaussian kernel density smoothing applied to continuous synthetic fields to reduce outlier replication risk. - Unique record removal: exact synthetic records matching unique real records were removed. SDC was applied to CART- and parametric-synthesized datasets (not to Bayesian network data). Changes in performance metrics and winning-classifier alignment were analyzed with and without SDC. Chi-square tests assessed differences in matching winning classifiers across sets of classifiers (all five vs removing DT then RF) at α=0.05.
Key Findings
- Overall performance: Across 19 datasets and three synthesis methods (57 synthetic datasets), 92% (263/285) of synthetic-trained models had lower accuracy than real-trained models when both were tested on real data. - Magnitude of accuracy differences (mean absolute difference across all synthetic methods): SVM 0.058 (5.8%), SGD 0.064 (6.4%), KNN 0.072 (7.2%), RF 0.177 (17.7%), DT 0.193 (19.3%). Pattern consistent within each synthesizer (DT/RF larger deviations than SGD/KNN/SVM). - Other metrics: Precision, recall, and F1 generally decreased for synthetic-trained models, with larger decreases and variance in DT and RF models; Bayesian network–generated data exhibited more variance in these metrics than CART/parametric. - Winning classifier alignment: With all five classifiers considered, the winning classifier trained on synthetic data matched the real-data winning classifier in 26% (5/19) of datasets for both CART and parametric synthesis, and 21% (4/19) for Bayesian network synthesis. Tree-based methods were the winning classifiers on real data in 95% (18/19) of cases, but less frequently when trained on synthetic data. Removing DT increased matches to 53% (10/19) on average; removing both DT and RF further increased matches to 74% (CART), 53% (parametric), and 68% (Bayesian network). Chi-square tests showed significant increases in matches when tree-based models were removed for CART and Bayesian network synthesizers (p≈0.009–0.0094). - Relationship preservation: Pairwise mutual information heatmaps showed that CART/parametric synthesis slightly reduced correlations in several mainly numerical datasets (e.g., C–G, I–K, S), while Bayesian network synthesis sometimes increased correlations (e.g., E–G, I–L, N, P–S). Categorical-heavy datasets generally preserved relationships better. - Impact of SDC: Applying SDC (smoothing, unique removal, minimum leaf size) typically led to small additional decreases in accuracy, especially for DT/RF. Mean absolute differences with SDC across all models: smoothing 0.059 (SGD), 0.190 (DT), 0.094 (KNN), 0.177 (RF), 0.060 (SVM); unique removal 0.052, 0.206, 0.072, 0.184, 0.056; minimum leaf size 0.061, 0.200, 0.068, 0.180, 0.053. The winning-classifier match rates with SDC applied were similar to the no-SDC case (overall 25%); removing tree-based models increased match percentages (e.g., up to 54% overall). Chi-square indicated a significant increase in matches when removing tree-based models for the unique removal SDC condition (p=0.003).
Discussion
The findings address the central question of whether synthetic data can reliably substitute for real data in training supervised models intended for deployment on real health data. Results indicate a small, manageable degradation in performance for most non-tree-based models (SVM, SGD, KNN), suggesting synthetic data can be a practical proxy for model development and method selection in many cases. However, tree-based models (DT, RF) show greater sensitivity to synthesis, with larger drops in accuracy and more variable precision/recall/F1; consequently, the best-performing classifier identified using synthetic training data often does not align with the best classifier identified using real data when tree-based algorithms dominate. This discrepancy diminishes substantially when tree-based methods are excluded, implying that synthesis methods may distort decision boundaries or dependency structures exploited by trees, especially in datasets with many numerical features. The minimal additional utility loss from SDC measures indicates these privacy reinforcements can be applied without substantial further harm to model performance beyond synthesis itself. Collectively, the results support the feasibility of a pipeline wherein synthetic data are shared externally for exploratory analysis and model development, with final validation and calibration on real data within secure environments. This approach can accelerate analytics, broaden participation, and respect privacy constraints, provided stakeholders account for expected small performance gaps and exercise caution with tree-based models.
Conclusion
Synthetic data can serve as a viable proxy for real health datasets in supervised machine learning, with small, consistent decreases in accuracy for most models when evaluated on real data. While tree-based classifiers exhibit greater sensitivity to synthesis, non-tree methods maintain relatively close performance to real-trained baselines. The study provides empirical baselines for expected performance differences and demonstrates that rules-based SDC measures do not notably diminish utility beyond synthesis. A practical data-sharing pipeline is proposed to enable external model development on synthetic data with secure validation on real data by health departments. Future work should improve synthesis robustness for tree-based learning, perform broader evaluations across additional algorithms (including hyperparameter optimization and unsupervised methods), incorporate more and larger real-world health datasets, and systematically quantify disclosure risk alongside utility trade-offs.
Limitations
The study is limited to five supervised learning algorithms with fixed hyperparameters, without extensive hyperparameter optimization. The evaluation focuses on classification tasks and open health datasets rather than proprietary, large-scale health department data. The analysis does not comprehensively quantify disclosure risk; only certain rules-based SDC techniques were examined, and Bayesian network–generated data did not undergo SDC. Broader method coverage (including unsupervised learning and additional model families), more diverse datasets, and explicit utility–risk trade-off analyses are needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny