Medicine and Health

DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation with Generative Models and Biomedical Knowledge to Enhance Inference Robustness

Y. Wang, Z. Wang, et al.

Discover groundbreaking research by Yuqi Wang and colleagues on enhancing the safety and reliability of natural language inference in clinical trial report analysis. This innovative approach uses generative models and biomedical knowledge graphs to create diverse synthetic data, leading to significant improvements in NLI performance.

00:00

Playback language: English

Index

Introduction

Analyzing the growing corpus of clinical trial reports (CTRs) is a significant challenge. The sheer volume of data necessitates scalable approaches for evaluation and interpretation. Natural Language Processing (NLP) offers promising solutions, including applications in medical evidence understanding, information retrieval, causal relationship identification, and inference of trial outcomes. Integrating NLI with CTRs holds immense potential for large-scale analysis of experimental medicine. However, large language models (LLMs) face challenges like shortcut learning, hallucination, and biases in applying to this task. This research addresses these limitations by proposing a novel method leveraging generative language models and biomedical domain knowledge to enhance data diversity and improve model robustness.

Literature Review

Existing literature highlights the need for robust NLI models in healthcare due to the potential severity of misinterpretations. SemEval 2024 Task 2 focuses on predicting the logical relationship between CTRs and statements, emphasizing accuracy and robustness. Data augmentation techniques, including synthetic data generation and multi-task learning, have been explored to enhance model generalization and faithful reasoning. Training LLMs on domain-specific medical datasets can also improve performance. The paper builds on this prior work by combining these techniques.

Methodology

The proposed system uses three data augmentation techniques: 1. **Numerical Question Answering (NQA):** GPT-3.5 generates multiple-choice questions requiring numerical reasoning from entailed statements. The DeBERTa model learns to answer these questions, enhancing its numerical reasoning capabilities. A binary cross-entropy loss function is used for this auxiliary task, combined with the main NLI task loss. 2. **Semantic Perturbation:** GPT-3.5 generates semantically altered (contradictory and entailed) versions of the original statements. This creates a more diverse dataset by introducing varied phrasing while maintaining semantic equivalence or contrast. 3. **Vocabulary Replacement:** A combination of biomedical knowledge graph embedding and TF-IDF is used to identify and replace keywords in statements with synonyms from the biomedical domain. This helps align model vocabulary with the clinical domain. The augmented data, along with the original data and CTRs, is used to train a DeBERTa model using multi-task learning, with the loss function combining the NLI task and the auxiliary NQA task. Experiments were conducted using different sized DeBERTa models. Metrics included F1 score, precision, recall, faithfulness, and consistency, evaluated on a control set and a contrast set with interventions.

Key Findings

Experiments on the NLI4CT 2024 dataset showed that incorporating all three augmentation methods significantly improved average faithfulness and consistency scores. The larger DeBERTa model benefited more from the augmented data. Specifically, faithfulness improved by 8.17% for DeBERTa-1 and 2.37% for DeBERTa-b. Semantic perturbation contributed most to the performance gains, while vocabulary replacement had a smaller effect. However, a slight performance drop was observed on the control set (unaltered data), suggesting a trade-off between robustness and performance on original data. This drop might be attributed to noisy or irrelevant data generated by the AI models.

Discussion

The findings demonstrate the effectiveness of the proposed data augmentation approach in enhancing the robustness of NLI models for CTR analysis. The use of generative models and biomedical knowledge proved crucial in creating a more diverse and representative dataset. The trade-off between robustness and performance on original data highlights the need for careful generation and validation of synthetic data. The results contribute to the development of more reliable and trustworthy NLI systems for clinical applications.

Conclusion

This paper presents a data augmentation strategy to improve the robustness of NLI models for clinical trial reports. The approach combines generative AI, biomedical knowledge, and multi-task learning to achieve significant improvements in faithfulness and consistency. Future work will focus on generating higher-quality augmented examples, validating perturbed samples, and incorporating external structured knowledge through pre-training on knowledge graphs.

Limitations

The study observed a slight performance decrease on the unaltered test set after adding augmented data. This indicates a potential trade-off between improving robustness and maintaining performance on original data. The generation of noisy or irrelevant augmented examples by the generative AI models may be a contributing factor. Future research will address this limitation by focusing on generating higher-quality augmented examples and filtering out irrelevant data.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Population Pharmacokinetic and Exposure–Response Analysis of Finerenone: Insights Based on Phase IIb Data and Simulations to Support Dose Selection for Pivotal Trials in Type 2 Diabetes with Chronic Kidney Disease

N. Snelder, R. Heinig, et al.

Mathematics

Practical parameter identifiability and handling of censored data with Bayesian inference in mathematical tumour models

J. Porthiyas, D. Nussey, et al.

Engineering and Technology

Extracting accurate materials data from research papers with conversational language models and prompt engineering

M. P. Polak and D. Morgan

Medicine and Health

Risk factors for and pregnancy outcomes after SARS-CoV-2 in pregnancy according to disease severity: A nationwide cohort study with validation of the SARS-CoV-2 diagnosis of Nordic Federation of Societies of Obstetrics and Gynecology (NFOG)

A. J. M. Aabakke, T. G. Petersen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny