logo
ResearchBunny Logo
DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

Medicine and Health

DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

Y. Wang, Z. Wang, et al.

Discover groundbreaking research by Yuqi Wang and colleagues on enhancing the safety and reliability of natural language inference in clinical trial report analysis. This innovative approach uses generative models and biomedical knowledge graphs to create diverse synthetic data, leading to significant improvements in NLI performance.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of building robust, trustworthy natural language inference (NLI) systems for clinical trial reports (CTRs). Despite strong performance of large language models on general NLI, they remain vulnerable to shortcut learning, hallucinations, and dataset biases, particularly for biomedical text and numerical reasoning. The research goal is to improve robustness and reliability of NLI for CTRs under controlled interventions (semantic-preserving and semantic-altering), enhancing faithfulness and consistency. The authors propose augmenting training data via generative models and biomedical knowledge to reduce bias, increase diversity, and strengthen numerical and domain reasoning, combined with multi-task learning using DeBERTa.
Literature Review
Prior work highlights the need for accurate, faithful reasoning in healthcare NLI. Data augmentation via conditional generation can expand diversity and improve generalization (Liu et al., 2020; Puri et al., 2020; Bayer et al., 2023). Multi-task learning with auxiliary objectives supports faithful reasoning (Li et al., 2022). Domain adaptation through training on biomedical corpora helps encode clinical knowledge (Singhal et al., 2023; Tian et al., 2024). The SemEval 2024 Task 2 (NLI4CT) provides a benchmark focused on both accuracy and robustness to controlled interventions in CTR-based NLI (Jullien et al., 2024; Jullien et al., 2023). Numerical reasoning remains a known weakness for many LMs (Geva et al., 2020). Biomedical vocabulary presents challenges for general-domain pre-trained models, motivating integration of knowledge graphs and domain embeddings (Wang et al., 2018; Zhang et al., 2019; Wang et al., 2023a).
Methodology
System overview: The approach augments entailed statements from the NLI dataset along three axes—numerical reasoning, semantic perturbation, and biomedical vocabulary alignment—then trains DeBERTa models with multi-task learning. 1) Numerical Question Answering (NQA) auxiliary task: Using GPT-3.5, each entailed statement paired with a CTR is converted into a multiple-choice question requiring numerical or quantitative reasoning, with three answer choices and one correct answer grounded in the original statement. A classifier atop DeBERTa predicts whether a candidate choice is correct. The system trains with a combined loss: main NLI classification loss plus a weighted binary cross-entropy loss for NQA, with a tuning parameter controlling the contribution of NQA. 2) Semantic perturbation: GPT-3.5 generates two types of variants from each entailed statement: (a) semantic-preserving paraphrases labeled as entailment (guided by prompts such as “paraphrase”), and (b) minimally modified, contradiction-inducing statements labeled as contradiction (guided by prompts such as “contradicted” and “minor changes”). This yields controlled entailed and contradicted examples to enhance robustness against semantic interventions. 3) Vocabulary replacement in biomedical domain: To align with domain terminology, the method selects the most important non-stopword in a statement using TF-IDF. It then replaces this term with a synonym-like substitute from a biomedical vocabulary identified via nearest neighbor search in a biomedical knowledge graph embedding space (MeSH-based BioWordVec), constrained to the same part of speech. This creates adversarial lexical variants that test vocabulary robustness. Model and training: The backbone is DeBERTa (base and large variants). Input format for NLI: [CLS] CTR [SEP] claim [SEP]. Implementation uses Hugging Face models, PyTorch 2.1.1, Adam optimizer, learning rate 5e-6, batch size 4, max sequence length 512, up to 20 epochs with early stopping. NLTK supports preprocessing (stopword removal and POS tagging). Prompts for NQA and semantic perturbation are specified to standardize generation. Dataset and metrics: Experiments are on NLI4CT 2024. Training data matches NLI4CT 2023; validation and test include perturbed samples. Dataset statistics: Train 1,700 (850 entailment, 850 contradiction); Validation 2,142 (100 entailment, 100 contradiction, plus 1,606 semantic-altering and 336 semantic-preserving perturbations); Test 5,500 (250 entailment, 250 contradiction, plus 4,136 altering and 864 preserving). Metrics include F1/precision/recall on control (unaltered) sets; faithfulness for semantic-altering interventions; consistency for semantic-preserving interventions.
Key Findings
- Augmentations substantially improved robustness. Averaged across faithfulness and consistency, gains were 8.17% for DeBERTa-1 and 2.37% for DeBERTa-b over their respective baselines. - Best system ranking: 12th in faithfulness and 8th in consistency among 32 participating teams on NLI4CT 2024. - Semantic perturbation contributed the largest gains for both model sizes. Vocabulary replacement had smaller, incremental effects. - Trade-off observed: Incorporating all augmented data reduced control-set F1 by 3.16% for DeBERTa-1 and 0.48% for DeBERTa-b, indicating some degradation on unaltered inputs while improving robustness to interventions. - Qualitative analysis suggests occasional noise from generative augmentation (e.g., NQA producing questions not grounded in CTR) and suboptimal term substitutions despite embedding similarity, which may explain the control-set performance drop.
Discussion
The proposed augmentation pipeline directly targets known weaknesses of NLI models on CTRs: numerical reasoning gaps, sensitivity to semantic perturbations, and limited domain vocabulary handling. The observed improvements in faithfulness (for semantic-altering cases) and consistency (for semantic-preserving cases) demonstrate that training with targeted synthetic data reduces shortcut learning and increases robustness under controlled interventions. Semantic perturbation proved the most impactful, likely because it modifies entire statements to introduce diverse yet label-controlled training examples, whereas single-word lexical swaps provide more limited variation. The small decline in control-set F1 underscores a trade-off between robustness and baseline accuracy, attributable to occasional noise or mismatches in generated and substituted content. Overall, the results validate the multi-task, augmentation-centric strategy as an effective path to more reliable clinical NLI, with room to refine data quality to mitigate performance regressions on unperturbed data.
Conclusion
The paper presents a data-centric approach to enhance robustness of biomedical NLI for clinical trial reports by combining three augmentation strategies—numerical QA generation, semantic perturbation, and biomedical vocabulary replacement—with multi-task learning on DeBERTa. Experiments on NLI4CT 2024 show notable gains in faithfulness and consistency and competitive leaderboard rankings. Despite a modest decrease in control-set performance, the approach reduces sensitivity to interventions and improves reliability. Future work includes: (1) improving the quality of numerical QA generation to avoid ungrounded or irrelevant questions; (2) validating and filtering perturbed samples to remove noisy or illogical instances; (3) leveraging external structured knowledge through pre-training and integration of knowledge graphs beyond lexical substitution to provide richer contextual domain information.
Limitations
- Slight degradation on unaltered (control) data after augmentation indicates a robustness–accuracy trade-off. - Generative augmentation (especially NQA) can produce noisy or irrelevant items not grounded in the CTR, potentially hurting performance. - Vocabulary replacement via embedding similarity may yield substitutions that are not contextually appropriate despite high similarity, introducing label noise. - The approach relies on GPT-3.5 and external biomedical embeddings, which may introduce biases or errors and entail computational and access constraints.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny