logo
ResearchBunny Logo
Abstract
Synthetic data offers a solution to the scarcity of high-quality data for training large language models (LLMs). While studies show its effectiveness in improving LLM performance, this paper analyzes inherent flaws in synthetic data, particularly question-answer (Q-A) pairs. The uniform format of synthetic data can lead to pattern overfitting and output distribution shifts, hindering instruction-following capabilities. The authors propose a mitigation method using unlearning techniques, demonstrating its effectiveness in reversing these negative effects at low cost.
Publisher
Published On
Authors
Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, Weipeng Chen
Tags
synthetic data
large language models
question-answer pairs
pattern overfitting
unlearning techniques
instruction-following
performance improvement
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny