Synthetic data offers a solution to the scarcity of high-quality data for training large language models (LLMs). While studies show its effectiveness in improving LLM performance, this paper analyzes inherent flaws in synthetic data, particularly question-answer (Q-A) pairs. The uniform format of synthetic data can lead to pattern overfitting and output distribution shifts, hindering instruction-following capabilities. The authors propose a mitigation method using unlearning techniques, demonstrating its effectiveness in reversing these negative effects at low cost.
Publisher
Published On
Authors
Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, Weipeng Chen
Tags
synthetic data
large language models
question-answer pairs
pattern overfitting
unlearning techniques
instruction-following
performance improvement
Related Publications
Explore these studies to deepen your understanding of the subject.