logo
ResearchBunny Logo
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

This groundbreaking research by Jie Chen, Yupeng Zhang, Bingning Wang, Wayne Xin Zhao, Ji-Rong Wen, and Weipeng Chen explores the flaws in synthetic question-answer data used for training large language models. It highlights how these issues can lead to performance drops and proposes innovative unlearning techniques to mitigate these effects effectively and affordably.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates how synthetic data—especially GPT-4–generated Q-A pairs—affects LLM performance, beyond known benefits on benchmarks. The central observation is that continued pretraining on synthetic Q-A pairs followed by SFT yields substantial declines in instruction-following capabilities, despite improvements in base benchmark scores. The authors hypothesize that distributional mismatch and uniform structural patterns in synthetic data cause pattern overfitting and output distribution shifts. This work analyzes these flaws, measures their impact, and proposes an unlearning-based mitigation strategy to restore instruction following while preserving gains on benchmarks. The study fills a gap in understanding how advanced-model-generated synthetic data (e.g., GPT-4) can negatively affect instruction following and offers a practical remedy.
Literature Review
Related work on synthetic data shows strong gains on benchmarks (e.g., Phi-3, Magicoder), while highlighting risks such as model collapse when training on model-generated data (Shumailov et al., Seddik et al.) and artifacts amplified through recursive use (Model Autophagy Disorder, Alemohammad et al.). Task-specific synthetic data can lack diversity and exhibit biases; effectiveness varies by task. LLM unlearning has targeted removal of specific data, harmful content, or concepts via gradient ascent, parameter-efficient methods, or KL-based approaches, balancing performance and compute. However, unlearning has been underexplored as a tool to mitigate flaws introduced by synthetic pretraining data. This work applies unlearning to counteract distributional bias and pattern overfitting from synthetic Q-A data while maintaining model capabilities.
Methodology
Models and training setup: The authors train a Llama-like 2B parameter BaseLM from scratch on 1T tokens with cosine LR schedule (learning rate 1e-4). Architecture: ROPE positional embeddings; hidden size 2048; FFN size 5504; 32 heads; 32 layers; context length 4096. Chat models (BaseLM-Chat) are produced via SFT on MixedIns or OpenHermes-2.5 with LR 2e-5, 2 epochs, context 4096, batch 64. Datasets: NonSynth (diverse real data from webpages, books, research papers, codebases); SynthQA (GPT-4–generated Q-A pairs across math, coding, general knowledge); MixedIns (instructions from GPT-4 and humans, general knowledge/math/coding); U33B (synthetic reasoning paths for GSM8K); OpenHermes-2.5 (synthetic instruction/chat). Continued pretraining mixes 2% SynthQA into 300B tokens of data to obtain SynthLM. Benchmarks: Bilingual/general knowledge (MMLU, CMMLU, C-Eval), coding (HumanEval, MBPP), math (GSM8K), instruction following (FollowBench with SSR/HSR, MT-Bench). Defect analysis: To assess distribution mismatch, they sample 2,000 entries each from NonSynth and SynthQA, embed using BaseLM’s last hidden state, and visualize with t-SNE, showing considerable non-overlap (distributional differences). For simplified patterns, they compute token ID frequency KDEs using BaseLM’s tokenizer, finding distinct small peaks in SynthQA driven by structural tokens like “question”, “answer”, “summary”, indicating repetitive patterns. Pattern overfitting and output shift: They compute perplexity distributions (KDE) of OpenHermes-2.5 and MixedIns using BaseLM vs SynthLM. SynthLM exhibits a shifted distribution with reduced variance, consistent with overfitting/collapse-like behavior, which correlates with degraded instruction following after SFT. Unlearning strategy: They propose a total unlearning loss combining: (1) Lower-bounded forgetting loss to reduce reliance on biased synthetic patterns by inverting prediction probabilities in cross-entropy, preventing unbounded negative loss magnitudes; (2) Replay loss on non-synthetic training data to retain knowledge; (3) KL-based bias mitigation loss between current and original models on replay data to preserve output distributions. Total loss is a weighted sum with weights w_fgt, w_rpy, w_mtn. Unlearning experiments: Starting from SynthLM (continued pretraining on 300B tokens with 2% SynthQA, LR 5e-5), they apply unlearning using 1B tokens from SynthQA for forgetting with LR 5e-5, weights set to w_fgt=0.01, w_rpy=1, w_mtn=1, alongside replay and KL mitigation on NonSynth data, producing UnlearnLM. They also evaluate a baseline remedy (RefineLM): further training SynthLM on 300B tokens of NonSynth (LR 5e-5) to recover instruction following. Ablations: (a) Use U33B (2% of 40B tokens) to test generality: compare SynthLM*(U33B) vs UnlearnLM*(U33B) after SFT on OpenHermes-2.5. (b) Attribute gains to synthetic vs non-synthetic data by training MixSynthLM (40B tokens with 2% SynthQA) vs NonSynthLM (40B NonSynth only). (c) Compare lower-bounded forgetting loss vs traditional gradient ascent for unlearning. Evaluation and analysis: Report base model benchmarks (C-Eval, CMMLU, MMLU, HumanEval, MBPP, GSM8K, and average), and chat model metrics (FollowBench SSR/HSR, MT-Bench, plus base benchmarks) across BaseLM/BaseLM-Chat, SynthLM/SynthLM-Chat, RefineLM/RefineLM-Chat, and UnlearnLM/UnlearnLM-Chat. Also assess perplexity distribution shifts pre- and post-unlearning.
Key Findings
- Synthetic Q-A data improves base benchmark performance but degrades instruction following after SFT: • Base model gains: BaseLM avg 25.47 → SynthLM avg 32.74 across benchmarks; e.g., C-Eval 39.05 → 47.71, MMLU 38.08 → 47.27, HumanEval 9.76 → 18.90. • After SFT, instruction following declines for SynthLM-Chat vs BaseLM-Chat: FollowBench SSR 39.95 → 38.29; HSR 27.58 → 24.00; MT-Bench 5.45 → 5.39, despite higher base benchmarks (e.g., C-Eval 39.92 → 49.50, GSM8K 14.33 → 22.21). - Evidence of distribution mismatch and pattern overfitting: • t-SNE embeddings (2,000 samples each) show considerable non-overlap between NonSynth and SynthQA distributions. • Token-frequency KDE reveals small peaks in SynthQA corresponding to structural tokens (e.g., “question”, “answer”, “summary”), indicating repetitive patterns. • Perplexity KDEs on OpenHermes-2.5 and MixedIns show SynthLM has a shifted distribution with reduced variance relative to BaseLM, consistent with overfitting/collapse-like effects. - Unlearning restores instruction following at low cost while preserving most gains: • Applying the proposed unlearning (1B SynthQA tokens; w_fgt=0.01, w_rpy=1, w_mtn=1) yields UnlearnLM-Chat with improved instruction following vs BaseLM-Chat and SynthLM-Chat: SSR 42.00 (vs 39.95 and 38.29), HSR 27.87 (vs 27.58 and 24.00), MT-Bench 5.85 (vs 5.45 and 5.39). • Base benchmarks remain close to SynthLM with minor trade-offs: UnlearnLM avg 32.28 vs SynthLM 32.74; coding improves (HumanEval 20.73 vs 18.90; MBPP 18.60 vs 18.40); math drops (GSM8K 11.45 vs 16.60). • Perplexity distribution shift is partially corrected after unlearning, reducing pattern overfitting. - Alternative remedy (RefineLM) requires much more data and still underperforms unlearning on instruction following: • Training on 300B NonSynth tokens recovers instruction following (RefineLM-Chat SSR 39.60, HSR 25.22) but not as well as unlearning; and reduces base performance (RefineLM avg 31.42 vs SynthLM 32.74). - Synthetic data contributes critically to gains: • With 40B tokens, MixSynthLM (2% SynthQA) outperforms NonSynthLM (avg 31.07 vs 28.66; MMLU 45.00 vs 40.88), confirming synthetic data’s role in boosting base performance. - Lower-bounded forgetting loss outperforms traditional gradient ascent for unlearning: • Gradient ascent causes severe degradation (avg 19.82; C-Eval 26.58; GSM8K 6.82) versus the proposed method (avg 32.28), indicating bounded forgetting is more stable and effective.
Discussion
The findings substantiate the hypothesis that uniform, structurally biased synthetic Q-A data induces pattern overfitting and output distribution shifts, which hurt instruction-following skills after SFT. Despite synthetic data’s clear benefits for base benchmarks, the downstream instruction-following deficit is nontrivial. The proposed unlearning method directly targets the biased synthetic patterns while safeguarding learned knowledge via replay and KL-based stabilization. This approach efficiently restores instruction following (exceeding both Base and Refine models) with minimal additional data (1B tokens for unlearning) and preserves most benchmark gains. The broader implication is that synthetic data should be used judiciously, and post-hoc distribution correction via targeted unlearning can yield robust, instruction-capable models without large retraining costs. The correction of perplexity distributions suggests unlearning mitigates collapse-like behavior, improving generalization to real-world, diverse instructions.
Conclusion
The paper identifies inherent flaws in synthetic Q-A data—distribution mismatch and uniform patterns leading to overfitting—which shift output distributions and degrade instruction following after SFT. It introduces a practical, controllable unlearning strategy based on a lower-bounded forgetting loss plus replay and KL stabilization. Experiments show the method restores instruction-following capabilities beyond both the original base chat model and a heavy retraining baseline, while retaining most benchmark improvements from synthetic pretraining. The work provides guidance for effectively leveraging synthetic data and a low-cost corrective mechanism to enhance robustness and efficiency in LLM training. Future work includes scaling to larger models, extending to broader synthetic data modalities and generators, and refining unlearning objectives to further reduce trade-offs (e.g., in mathematical reasoning).
Limitations
The unlearning strategy can still impair specific capabilities (notably mathematical reasoning). Its scalability to much larger models is untested, and computational efficiency at larger scales needs validation. The study focuses mainly on GPT-4–generated Q-A synthetic data (with an additional test on U33B); other forms of synthetic data remain unexplored. The quality of GPT-4 synthetic data used may not represent the full range of synthetic data qualities and generation methods, potentially affecting generality. More advanced unlearning techniques may further reduce side effects, which warrants future investigation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny