logo
ResearchBunny Logo
Abstract
This study investigates the performance of supervised machine learning models trained on synthetic healthcare data compared to those trained on real data. Nineteen open health datasets were used, and synthetic data was generated using three methods: classification and regression trees (CART), parametric, and Bayesian networks. Five machine learning models were trained and tested. Results showed that 92% of models trained on synthetic data had lower accuracy than those trained on real data, although the differences were often small. Tree-based models showed more sensitivity to synthetic data. Statistical disclosure control methods did not significantly impact data utility. The study highlights the potential of synthetic data while noting the need for further evaluation of its robustness and the importance of preserving both individual privacy and data utility.
Publisher
JMIR Medical Informatics
Published On
Jul 20, 2020
Authors
Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, Gorka Epelde
Tags
synthetic data
machine learning
healthcare
data accuracy
privacy
data utility
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny