The increasing use of Large Language Models (LLMs) in clinical medicine necessitates effective knowledge transfer from computer science to medical applications. Prompt engineering, a technique for optimizing prompts to elicit desired responses from LLMs, offers potential in this area. While LLMs show promise in various NLP tasks, their application in medicine faces challenges like inconsistent answers to the same question and variable agreement with established guidelines. Studies have reported inconsistencies in LLM diagnoses and guideline adherence, highlighting the need for improvement. Prompt engineering has demonstrated the ability to improve LLM performance in computer science, with techniques like chain of thought (COT) and tree of thoughts (TOT) prompting showing improved results in mathematical problem-solving. However, research on prompt engineering's impact on LLMs answering medical questions is limited. This study focuses on osteoarthritis (OA), a prevalent and complex disease, to investigate whether prompt engineering can enhance the quality of LLM responses to medical queries. The study hypothesizes that different prompts will lead to varying consistency and reliability, and that prompt effectiveness will be influenced by factors such as the LLM model itself.
Literature Review
Existing literature highlights the potential and limitations of LLMs in medical applications. Several studies have explored LLMs' use in diagnosis, medical examinations, and guideline consistency assessments, but inconsistencies remain. Some research demonstrates the application of various prompt engineering techniques, such as COT, few-shot, and self-consistency prompting, in medical contexts. However, limited research has directly compared the performance of different prompts in medical question answering or investigated the necessity for developing specialized medical prompts. Most studies focus on the results of using LLMs, rather than on optimizing their application in clinical medicine. This study addresses this gap by examining the impact of various prompt engineering techniques on the accuracy and reliability of LLM responses related to OA management.
Methodology
This study used the 2019 AAOS evidence-based guidelines for osteoarthritis (OA) as the gold standard for evaluating LLM responses. Four types of prompts were designed: Input-Output (IO), 0-shot Chain of Thought (0-COT), Performed Chain of Thought (P-COT), and Reflection of Thoughts (ROT). Each of the nine LLMs (three versions of GPT-4, five versions of GPT-3.5 including fine-tuned versions, and Bard) was asked the same set of OA-related questions five times using each prompt. The consistency of the LLM responses with the AAOS guidelines was assessed for each prompt and model. Consistency was calculated as the percentage of responses that matched the guideline recommendations for each evidence level (strong, moderate, limited, consensus). The reliability of each LLM and prompt combination was assessed using Fleiss' kappa statistic, measuring the agreement between the five responses to the same question. Subgroup analyses were performed to examine the performance of different prompts across different evidence strength levels. Invalid data were categorized and handled according to a pre-defined procedure, considering missing data and non-integer ratings.
Key Findings
The results showed significant variation in consistency and reliability across different LLMs and prompts. GPT-4-Web demonstrated superior performance compared to other models, with consistency rates ranging from 50.6% to 63% across the four prompts. The combination of GPT-4-Web and ROT prompting achieved the highest overall consistency (62.9%), particularly for strong recommendations (77.5%). Other models showed considerably lower consistency rates (4.7% to 55.3%). Subgroup analysis revealed differences in performance across evidence strength levels. For strong recommendations, GPT-4-Web with ROT prompting showed significantly higher consistency (77.5%) than other prompts. Fleiss' kappa values, indicating reliability, ranged from -0.002 to 0.984. IO prompting in GPT-3.5-ft-0 and GPT-3.5-API-0 showed near-perfect reliability (kappa values near 1), while other combinations exhibited lower reliability. A substantial proportion of invalid data (especially in 0-COT prompting for GPT-3.5-API-0) was observed, potentially influencing the results. Comparison with other studies showed that GPT-4 generally outperformed GPT-3.5 and Bard in clinical medicine applications. Differences were also observed between GPT-4 accessed via web and API interfaces, suggesting the importance of parameter settings. Fine-tuning GPT-3.5 did not consistently improve performance, and the results varied with the choice of prompt used during the fine-tuning process.
Discussion
The findings support the hypothesis that prompt engineering can significantly impact the accuracy and reliability of LLMs in answering medical questions. The superior performance of GPT-4-Web with ROT prompting highlights the potential of carefully designed prompts to improve LLM responses. The variation in reliability across models and prompts underscores the importance of considering these factors when using LLMs for medical applications. The ROT prompt's success may be attributed to its iterative nature, which encourages the LLM to reconsider previous steps and refine its answer. The study's limitations, including the reliance on guidelines as the gold standard and the lack of human evaluation, should be considered when interpreting the results. The significant amount of invalid data also necessitates further refinement of data handling procedures. Future research should focus on developing more robust prompts, considering model-specific optimizations, and incorporating human evaluation to enhance the reliability and clinical utility of LLMs.
Conclusion
This study demonstrates that prompt engineering significantly impacts the accuracy and reliability of LLMs in responding to medical questions. The ROT prompt, particularly when used with GPT-4-Web, showed the most promising results. However, the reliability of LLMs varies across models and prompts, highlighting the need for careful consideration of these factors in clinical applications. Future research should focus on optimizing prompts for specific models and clinical scenarios, incorporating human evaluation, and addressing the challenges of data validity to enhance the reliability and clinical utility of LLMs in healthcare.
Limitations
The study's limitations include the reliance on a single set of guidelines (AAOS OA guidelines) and the lack of real-time interaction with healthcare professionals and patients. The evaluation was based on non-human subjective scoring, potentially overlooking nuances in the quality of responses. The significant proportion of invalid data in some datasets might have biased the results. Future research should include human evaluation, prospective validation, and a broader range of medical guidelines and conditions to address these limitations.
Related Publications
Explore these studies to deepen your understanding of the subject.