Medicine and Healthnpj Digital Medicine

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

L. Wang, X. Chen, et al.

Discover how prompt engineering can enhance the reliability of Large Language Models in answering medical queries. Learn from a study conducted by Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li, highlighting the crucial role of effective prompting strategies in medical accuracy.... show more

General Summary Metrics

Abstract

The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from -0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.

Publisher

npj Digital Medicine

Published On

Jan 01, 2024

Authors

Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, Jian Li

DOI

https://doi.org/10.1038/s41746-024-01029-4

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Medicine and Health

Development of prediction models for screening depression and anxiety using smartphone and wearable-based digital phenotyping: protocol for the Smartphone and Wearable Assessment for Real-Time Screening of Depression and Anxiety (SWARTS-DA) observational study in Korea

Y. Shin, A. Y. Kim, et al.

Medicine and Health

Efficacy of early PET-CT directed switch to carboplatin and paclitaxel based definitive chemoradiotherapy in patients with oesophageal cancer who have a poor early response to induction cisplatin and capecitabine in the UK: a multi-centre randomised controlled phase II trial

S. Mukherjee, C. N. Hurt, et al.

Business

Entrepreneurial universities and integrated sustainability for the knowledge-based economy: self-perception and some structural challenges in the Gulf region

E. Zaidan, R. Momani, et al.

Health and Fitness

Mediating effect of fat mass, lean mass, blood pressure and insulin resistance on the associations of accelerometer-based sedentary time and physical activity with arterial stiffness, carotid IMT and carotid elasticity in 1574 adolescents

A. O. Agbaje

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny