logo
ResearchBunny Logo
Introduction
Large Language Models (LLMs) have achieved remarkable success, largely due to the quality and diversity of their training datasets. However, acquiring large amounts of high-quality data is challenging due to scarcity, privacy concerns, and high costs. Synthetic data, generated algorithmically rather than collected from real-world sources, offers a promising solution. It can be produced at scale and supplement areas where real-world data is lacking, such as mathematical or reasoning tasks. Many studies have shown its efficacy in improving model performance, with common approaches including generating synthetic Q-A pairs and data mimicking downstream tasks. These methods have achieved impressive results on various benchmarks. Despite these positive findings, this research observes a decline in instruction-following capabilities of models after being pre-trained on synthetic Q-A pairs and undergoing supervised fine-tuning (SFT). This prompts an investigation into the flaws of synthetic data, particularly focusing on data generated by advanced models like GPT-4, an area largely unexplored in previous research. The authors hypothesize that the uniform structure of synthetic data leads to pattern overfitting and consequently to changes in output distribution, thus impacting the ability of the model to follow instructions effectively.
Literature Review
Existing literature extensively covers the applications of synthetic data in improving LLM performance on downstream benchmarks. Studies highlight successful applications, such as Microsoft's Phi-3 model and MagicoderS-CL-7B, demonstrating significant performance gains. However, the literature also acknowledges limitations. Concerns include model collapse, where training on model-generated data leads to a loss of diversity in the original data distribution. The concept of Model Autophagy Disorder (MAD) highlights the risk of amplifying artifacts and biases through recursive use of synthetic data. Task-specific synthetic data often suffers from lack of diversity and regional biases. In the area of LLM unlearning, research focuses on removing undesired targets while maintaining overall performance. Strategies involve removing specific data points or higher-level concepts. Effective unlearning needs robustness, generalization, and efficient computational cost management. Various fine-tuning techniques are employed, including gradient ascent, parameter-efficient fine-tuning, and KL-divergence-based methods. While unlearning has been used to handle harmful data, its application to mitigating the effects of synthetic data in LLMs remains relatively under-explored.
Methodology
The researchers employed a comprehensive experimental design involving diverse datasets, model configurations, and evaluation benchmarks. The datasets included: * **NonSynth data:** A large, non-synthetic dataset from various sources (webpages, books, research papers, codebases). * **SynthQA data:** Synthetic Q-A pairs generated by GPT-4, covering various topics. * **MixedIns data:** Instructions in general knowledge, mathematics, and coding, generated by GPT-4 and human contributors. * **U33B data:** A synthetic dataset of diverse reasoning paths generated from the GSM8K dataset. * **OpenHermes-2.5 data:** A dataset primarily consisting of synthetically generated instruction and chat samples. Two models were used: * **BaseLM:** A Llama-like 2B parameter model trained from scratch on a trillion tokens. * **BaseLM-Chat:** Chat models obtained by performing SFT on BaseLM using MixedIns or OpenHermes-2.5 data. The benchmarks included evaluating bilingual capabilities (MMLU, CMMLU, C-Eval), coding proficiency (HumanEval, MBPP), mathematical reasoning (GSM8K), and instruction-following capabilities (FollowBench, MT-bench). The defect analysis involved examining data distribution differences and pattern overfitting. t-SNE visualization and kernel density estimation (KDE) were used to compare the distributions of NonSynth and SynthQA data. Perplexity was calculated to assess the effect of synthetic data on output distributions. The proposed unlearning strategy aimed to mitigate the negative effects of synthetic data. It used a three-part loss function: * **Lower-Bounded Forgetting Loss:** To forget the biased distribution of synthetic data. * **Replay Loss:** To retain memory of historical knowledge. * **Bias Mitigation Loss:** To prevent excessive changes in output distribution on non-synthetic data. Experiments involved continued pre-training with and without synthetic data and application of the unlearning strategy to evaluate its effectiveness in restoring instruction-following capabilities while preserving benchmark performance. Ablation studies were conducted to assess the impact of the unlearning strategy and the role of synthetic data on model performance.
Key Findings
The analysis revealed significant differences in token distribution between synthetic and real-world data. Synthetic data exhibited more uniform patterns, leading to pattern overfitting in LLMs. This overfitting resulted in a decline in instruction-following capabilities after SFT, despite improved performance on benchmarks during pre-training. The authors observed a shift and reduced variance in the perplexity distribution of models trained with synthetic data, a phenomenon similar to model collapse. The proposed unlearning strategy effectively mitigated these issues. The lower-bounded forgetting loss proved superior to traditional gradient ascent, preventing uncontrolled loss. The unlearning process partially corrected the output distribution shift, improving instruction-following abilities while preserving foundational abilities. Even with only 1 billion tokens used for unlearning, the resulting model (UnlearnLM-Chat) outperformed models trained on 300 billion tokens of non-synthetic data (RefineLM-Chat) in terms of instruction-following. Ablation studies confirmed the crucial role of synthetic data in improving base model performance and validated the effectiveness of the unlearning strategy across different types of synthetic data (U33B). The study also showed that the lower-bounded forgetting loss was crucial for the success of the unlearning strategy, as a traditional gradient ascent approach caused significant performance degradation.
Discussion
The findings highlight the critical need for a deeper understanding of the interplay between synthetic data and LLM training. While synthetic data can significantly boost performance on certain benchmarks, it can simultaneously impair other crucial aspects, such as instruction-following. The success of the proposed unlearning strategy demonstrates that the negative impacts of synthetic data are not insurmountable. The carefully designed loss function, particularly the lower-bounded forgetting loss, is key to effectively counteracting the overfitting caused by the simplified patterns in synthetic data. The results suggest a practical pathway for harnessing the benefits of synthetic data while mitigating its limitations, thereby improving the robustness and efficiency of LLM training.
Conclusion
This research identifies significant flaws in synthetic data, particularly in synthetic Q-A pairs, that negatively affect instruction-following capabilities in LLMs. A novel unlearning-based strategy, using a lower-bounded forgetting loss, effectively mitigates these issues, improving instruction-following while maintaining performance on various benchmarks. Future research should explore the scalability of this method to larger models and investigate its applicability to other forms of synthetic data. Further investigation into advanced unlearning techniques to minimize side effects is also warranted.
Limitations
While the unlearning strategy demonstrates effectiveness, some limitations exist. It may still cause degradation in certain model capabilities, such as mathematical reasoning, and its scalability to very large models needs further testing. The study primarily focuses on Q-A pair synthetic data, and the generalizability to other types of synthetic data requires further investigation. The quality of the GPT-4 generated synthetic data may not fully represent all possible levels of synthetic data quality, and variations in synthetic data generation methods might impact the effectiveness of the proposed strategy. Improving the computational efficiency and minimizing side effects are areas for future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny