logo
ResearchBunny Logo
Abstract
This study explores the impact of prompt engineering on the consistency and reliability of Large Language Models (LLMs) in answering medical questions. Different prompt styles were used to query various LLMs regarding their agreement with osteoarthritis (OA) guidelines. GPT-4-Web with ROT prompting showed the highest overall consistency (62.9%), particularly for strong recommendations (77.5%). Reliability varied significantly across models and prompts (Fleiss kappa ranged from -0.002 to 0.984). The study suggests that appropriate prompt engineering can improve the accuracy of LLM responses to medical questions.
Publisher
npj Digital Medicine
Published On
Jan 01, 2024
Authors
Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, Jian Li
Tags
prompt engineering
Large Language Models
osteoarthritis
consistency
reliability
medical questions
accuracy
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny