This study investigates the performance of supervised machine learning models trained on synthetic healthcare data compared to those trained on real data. Nineteen open health datasets were used, and synthetic data was generated using three methods: classification and regression trees (CART), parametric, and Bayesian networks. Five machine learning models were trained and tested. Results showed that 92% of models trained on synthetic data had lower accuracy than those trained on real data, although the differences were often small. Tree-based models showed more sensitivity to synthetic data. Statistical disclosure control methods did not significantly impact data utility. The study highlights the potential of synthetic data while noting the need for further evaluation of its robustness and the importance of preserving both individual privacy and data utility.
Publisher
JMIR Medical Informatics
Published On
Jul 20, 2020
Authors
Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, Gorka Epelde
Tags
synthetic data
machine learning
healthcare
data accuracy
privacy
data utility
Related Publications
Explore these studies to deepen your understanding of the subject.