Medicine and Health

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

L. Wang, X. Chen, et al.

Discover how prompt engineering can enhance the reliability of Large Language Models in answering medical queries. Learn from a study conducted by Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li, highlighting the crucial role of effective prompting strategies in medical accuracy.

00:00

Playback language: English

Index

Introduction

The increasing use of Large Language Models (LLMs) in clinical medicine necessitates effective knowledge transfer from computer science to medical applications. Prompt engineering, a technique for optimizing prompts to elicit desired responses from LLMs, offers potential in this area. While LLMs show promise in various NLP tasks, their application in medicine faces challenges like inconsistent answers to the same question and variable agreement with established guidelines. Studies have reported inconsistencies in LLM diagnoses and guideline adherence, highlighting the need for improvement. Prompt engineering has demonstrated the ability to improve LLM performance in computer science, with techniques like chain of thought (COT) and tree of thoughts (TOT) prompting showing improved results in mathematical problem-solving. However, research on prompt engineering's impact on LLMs answering medical questions is limited. This study focuses on osteoarthritis (OA), a prevalent and complex disease, to investigate whether prompt engineering can enhance the quality of LLM responses to medical queries. The study hypothesizes that different prompts will lead to varying consistency and reliability, and that prompt effectiveness will be influenced by factors such as the LLM model itself.

Literature Review

Existing literature highlights the potential and limitations of LLMs in medical applications. Several studies have explored LLMs' use in diagnosis, medical examinations, and guideline consistency assessments, but inconsistencies remain. Some research demonstrates the application of various prompt engineering techniques, such as COT, few-shot, and self-consistency prompting, in medical contexts. However, limited research has directly compared the performance of different prompts in medical question answering or investigated the necessity for developing specialized medical prompts. Most studies focus on the results of using LLMs, rather than on optimizing their application in clinical medicine. This study addresses this gap by examining the impact of various prompt engineering techniques on the accuracy and reliability of LLM responses related to OA management.

Methodology

This study used the 2019 AAOS evidence-based guidelines for osteoarthritis (OA) as the gold standard for evaluating LLM responses. Four types of prompts were designed: Input-Output (IO), 0-shot Chain of Thought (0-COT), Performed Chain of Thought (P-COT), and Reflection of Thoughts (ROT). Each of the nine LLMs (three versions of GPT-4, five versions of GPT-3.5 including fine-tuned versions, and Bard) was asked the same set of OA-related questions five times using each prompt. The consistency of the LLM responses with the AAOS guidelines was assessed for each prompt and model. Consistency was calculated as the percentage of responses that matched the guideline recommendations for each evidence level (strong, moderate, limited, consensus). The reliability of each LLM and prompt combination was assessed using Fleiss' kappa statistic, measuring the agreement between the five responses to the same question. Subgroup analyses were performed to examine the performance of different prompts across different evidence strength levels. Invalid data were categorized and handled according to a pre-defined procedure, considering missing data and non-integer ratings.

Key Findings

The results showed significant variation in consistency and reliability across different LLMs and prompts. GPT-4-Web demonstrated superior performance compared to other models, with consistency rates ranging from 50.6% to 63% across the four prompts. The combination of GPT-4-Web and ROT prompting achieved the highest overall consistency (62.9%), particularly for strong recommendations (77.5%). Other models showed considerably lower consistency rates (4.7% to 55.3%). Subgroup analysis revealed differences in performance across evidence strength levels. For strong recommendations, GPT-4-Web with ROT prompting showed significantly higher consistency (77.5%) than other prompts. Fleiss' kappa values, indicating reliability, ranged from -0.002 to 0.984. IO prompting in GPT-3.5-ft-0 and GPT-3.5-API-0 showed near-perfect reliability (kappa values near 1), while other combinations exhibited lower reliability. A substantial proportion of invalid data (especially in 0-COT prompting for GPT-3.5-API-0) was observed, potentially influencing the results. Comparison with other studies showed that GPT-4 generally outperformed GPT-3.5 and Bard in clinical medicine applications. Differences were also observed between GPT-4 accessed via web and API interfaces, suggesting the importance of parameter settings. Fine-tuning GPT-3.5 did not consistently improve performance, and the results varied with the choice of prompt used during the fine-tuning process.

Discussion

The findings support the hypothesis that prompt engineering can significantly impact the accuracy and reliability of LLMs in answering medical questions. The superior performance of GPT-4-Web with ROT prompting highlights the potential of carefully designed prompts to improve LLM responses. The variation in reliability across models and prompts underscores the importance of considering these factors when using LLMs for medical applications. The ROT prompt's success may be attributed to its iterative nature, which encourages the LLM to reconsider previous steps and refine its answer. The study's limitations, including the reliance on guidelines as the gold standard and the lack of human evaluation, should be considered when interpreting the results. The significant amount of invalid data also necessitates further refinement of data handling procedures. Future research should focus on developing more robust prompts, considering model-specific optimizations, and incorporating human evaluation to enhance the reliability and clinical utility of LLMs.

Conclusion

This study demonstrates that prompt engineering significantly impacts the accuracy and reliability of LLMs in responding to medical questions. The ROT prompt, particularly when used with GPT-4-Web, showed the most promising results. However, the reliability of LLMs varies across models and prompts, highlighting the need for careful consideration of these factors in clinical applications. Future research should focus on optimizing prompts for specific models and clinical scenarios, incorporating human evaluation, and addressing the challenges of data validity to enhance the reliability and clinical utility of LLMs in healthcare.

Limitations

The study's limitations include the reliance on a single set of guidelines (AAOS OA guidelines) and the lack of real-time interaction with healthcare professionals and patients. The evaluation was based on non-human subjective scoring, potentially overlooking nuances in the quality of responses. The significant proportion of invalid data in some datasets might have biased the results. Future research should include human evaluation, prospective validation, and a broader range of medical guidelines and conditions to address these limitations.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Entrepreneurial universities and integrated sustainability for the knowledge-based economy: self-perception and some structural challenges in the Gulf region

E. Zaidan, R. Momani, et al.

Medicine and Health

Efficacy of early PET-CT directed switch to carboplatin and paclitaxel based definitive chemoradiotherapy in patients with oesophageal cancer who have a poor early response to induction cisplatin and capecitabine in the UK: a multi-centre randomised controlled phase II trial

S. Mukherjee, C. N. Hurt, et al.

Health and Fitness

Mediating effect of fat mass, lean mass, blood pressure and insulin resistance on the associations of accelerometer-based sedentary time and physical activity with arterial stiffness, carotid IMT and carotid elasticity in 1574 adolescents

A. O. Agbaje

Medicine and Health

Population Pharmacokinetic and Exposure–Response Analysis of Finerenone: Insights Based on Phase IIb Data and Simulations to Support Dose Selection for Pivotal Trials in Type 2 Diabetes with Chronic Kidney Disease

N. Snelder, R. Heinig, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny