Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

This groundbreaking research by Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, and Weipeng Chen explores the flaws in synthetic question-answer data used for training large language models. It highlights how these issues can lead to performance drops and proposes innovative unlearning techniques to mitigate these effects effectively and affordably.... show more

Abstract

Synthetic data has been proposed to alleviate high-quality data scarcity for training large language models (LLMs). While prior work shows synthetic data can improve downstream benchmark performance, the authors identify inherent flaws: uniform formats in synthetic data induce pattern overfitting and significant output distribution shifts that degrade instruction-following ability. Focusing on synthetic question–answer (Q-A) pairs, the paper analyzes these flaws and introduces an unlearning-based mitigation strategy using a controllable lower-bounded forgetting loss. Experiments show the method reverses instruction-following degradation caused by pattern overfitting without sacrificing benchmark performance and at relatively low cost, offering guidance for robust and efficient LLM training with synthetic data.

Publisher

Published On

Authors

Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, Weipeng Chen

DOI

https://doi.org/10.48550/arXiv.2406.12397

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A. K. P. Venkata and L. Gudala

Psychology

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination

A. Bhattacharjee, Y. Zeng, et al.

Psychology

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination

A. Bhattacharjee, Y. Zeng, et al.

Education

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

A. Gilson, C. W. Safranek, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny