Analyzing the growing corpus of clinical trial reports (CTRs) is a significant challenge. The sheer volume of data necessitates scalable approaches for evaluation and interpretation. Natural Language Processing (NLP) offers promising solutions, including applications in medical evidence understanding, information retrieval, causal relationship identification, and inference of trial outcomes. Integrating NLI with CTRs holds immense potential for large-scale analysis of experimental medicine. However, large language models (LLMs) face challenges like shortcut learning, hallucination, and biases in applying to this task. This research addresses these limitations by proposing a novel method leveraging generative language models and biomedical domain knowledge to enhance data diversity and improve model robustness.
Literature Review
Existing literature highlights the need for robust NLI models in healthcare due to the potential severity of misinterpretations. SemEval 2024 Task 2 focuses on predicting the logical relationship between CTRs and statements, emphasizing accuracy and robustness. Data augmentation techniques, including synthetic data generation and multi-task learning, have been explored to enhance model generalization and faithful reasoning. Training LLMs on domain-specific medical datasets can also improve performance. The paper builds on this prior work by combining these techniques.
Methodology
The proposed system uses three data augmentation techniques:
1. **Numerical Question Answering (NQA):** GPT-3.5 generates multiple-choice questions requiring numerical reasoning from entailed statements. The DeBERTa model learns to answer these questions, enhancing its numerical reasoning capabilities. A binary cross-entropy loss function is used for this auxiliary task, combined with the main NLI task loss.
2. **Semantic Perturbation:** GPT-3.5 generates semantically altered (contradictory and entailed) versions of the original statements. This creates a more diverse dataset by introducing varied phrasing while maintaining semantic equivalence or contrast.
3. **Vocabulary Replacement:** A combination of biomedical knowledge graph embedding and TF-IDF is used to identify and replace keywords in statements with synonyms from the biomedical domain. This helps align model vocabulary with the clinical domain.
The augmented data, along with the original data and CTRs, is used to train a DeBERTa model using multi-task learning, with the loss function combining the NLI task and the auxiliary NQA task. Experiments were conducted using different sized DeBERTa models. Metrics included F1 score, precision, recall, faithfulness, and consistency, evaluated on a control set and a contrast set with interventions.
Key Findings
Experiments on the NLI4CT 2024 dataset showed that incorporating all three augmentation methods significantly improved average faithfulness and consistency scores. The larger DeBERTa model benefited more from the augmented data. Specifically, faithfulness improved by 8.17% for DeBERTa-1 and 2.37% for DeBERTa-b. Semantic perturbation contributed most to the performance gains, while vocabulary replacement had a smaller effect. However, a slight performance drop was observed on the control set (unaltered data), suggesting a trade-off between robustness and performance on original data. This drop might be attributed to noisy or irrelevant data generated by the AI models.
Discussion
The findings demonstrate the effectiveness of the proposed data augmentation approach in enhancing the robustness of NLI models for CTR analysis. The use of generative models and biomedical knowledge proved crucial in creating a more diverse and representative dataset. The trade-off between robustness and performance on original data highlights the need for careful generation and validation of synthetic data. The results contribute to the development of more reliable and trustworthy NLI systems for clinical applications.
Conclusion
This paper presents a data augmentation strategy to improve the robustness of NLI models for clinical trial reports. The approach combines generative AI, biomedical knowledge, and multi-task learning to achieve significant improvements in faithfulness and consistency. Future work will focus on generating higher-quality augmented examples, validating perturbed samples, and incorporating external structured knowledge through pre-training on knowledge graphs.
Limitations
The study observed a slight performance decrease on the unaltered test set after adding augmented data. This indicates a potential trade-off between improving robustness and maintaining performance on original data. The generation of noisy or irrelevant augmented examples by the generative AI models may be a contributing factor. Future research will address this limitation by focusing on generating higher-quality augmented examples and filtering out irrelevant data.
Related Publications
Explore these studies to deepen your understanding of the subject.