logo
Loading...
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models
Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

This groundbreaking research by Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, and Weipeng Chen explores the flaws in synthetic question-answer data used for training large language models. It highlights how these issues can lead to performance drops and proposes innovative unlearning techniques to mitigate these effects effectively and affordably.... show more
Abstract
Synthetic data has been proposed to alleviate high-quality data scarcity for training large language models (LLMs). While prior work shows synthetic data can improve downstream benchmark performance, the authors identify inherent flaws: uniform formats in synthetic data induce pattern overfitting and significant output distribution shifts that degrade instruction-following ability. Focusing on synthetic question–answer (Q-A) pairs, the paper analyzes these flaws and introduces an unlearning-based mitigation strategy using a controllable lower-bounded forgetting loss. Experiments show the method reverses instruction-following degradation caused by pattern overfitting without sacrificing benchmark performance and at relatively low cost, offering guidance for robust and efficient LLM training with synthetic data.
Publisher
Published On
Authors
Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, Weipeng Chen
Tags
synthetic datalarge language modelsquestion-answer pairspattern overfittingunlearning techniquesinstruction-followingperformance improvement
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny