logo
ResearchBunny Logo
Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Medicine and Health

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

L. Wang, X. Chen, et al.

Discover how prompt engineering can enhance the reliability of Large Language Models in answering medical queries. Learn from a study conducted by Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li, highlighting the crucial role of effective prompting strategies in medical accuracy.

00:00
00:00
~3 min • Beginner • English
Introduction
Large language models (LLMs) have demonstrated strong capabilities across NLP tasks and are increasingly explored in medicine for diagnoses, examinations, and alignment with guidelines. Despite promise, performance in the medical domain remains imperfect: GPT-4 achieved 39% diagnostic concordance in complex cases and ~60% guideline consistency for digestive diseases; Med-PaLM answers contained 18% inappropriate/incorrect content; and LLM self-consistency is a known concern. Prompt engineering—particularly techniques like chain-of-thought (COT) and tree-of-thoughts (TOT)—can substantially affect LLM performance in reasoning-heavy tasks. Early clinical applications have tried COT, few-shot, and self-consistency prompting, but few studies compare prompt types on medical questions or assess prompt-specific designs for medicine. Given osteoarthritis (OA)’s high prevalence, clinical complexity, and patient/clinician reliance on online information, the authors used OA guideline questions to test whether prompt engineering improves accuracy and reliability. They hypothesized that different prompt types would produce different consistency and reliability, influenced by model architecture, parameters, training data, and fine-tuning.
Literature Review
Prior studies report mixed medical performance for LLMs: GPT-4 showed 39% concordance on complex diagnoses and ~60% guideline alignment in digestive diseases; Med-PaLM responses had 18% inappropriate/incorrect content. Prompt engineering methods such as COT and TOT improved reasoning in mathematical domains, suggesting potential benefits in clinical tasks. Limited clinical work directly compares prompts; for example, Karan et al. evaluated COT, few-shot, and self-consistency prompting with mixed results across datasets (self-consistency improved MedQA but reduced PubMedQA performance). Summaries for medical professionals highlight prompt design principles, but few studies systemically assess prompt effects on medical guideline adherence or develop prompts tailored for medical questions. Parameterization (e.g., temperature), architecture differences, training data, and fine-tuning further modulate outcomes, underscoring the need to evaluate these factors alongside prompt design.
Methodology
Study design: Pilot evaluation of prompt engineering effects on LLM performance for osteoarthritis (OA) guideline adherence and answer reliability. Guidelines and items: Used AAOS evidence-based clinical practice guidelines for OA as ground truth. AAOS recommendations include evidence levels: strong, moderate, limited, and consensus. The study evaluated 34 recommendation items (8 strong, 8 moderate, 16 limited, 2 consensus). Each question was posed 5 times per prompt-model combination. Prompts: Four prompt types were designed: (1) Input-Output (IO): direct instruction with rating request (integer 1–4). (2) Zero-shot Chain-of-Thought (0-COT): adds “think step by step” reasoning to IO. (3) Performed Chain-of-Thought (P-COT): specifies multi-step reasoning structure. (4) Reflection of Thoughts (ROT): multi-step reasoning with simulated multi-expert independent reasoning followed by discussion and backtracking to reach consensus. Prompts instructed LLMs to output a rating score. Models and settings: Nine LLM configurations were tested. GPT-4: gpt-4-Web (web UI, release notes July 20, 2023), gpt-4-API (gpt-4-0613, temperature=1), gpt-4-API-0 (gpt-4-0613, temperature=0). Bard (web UI, Oct 30, 2023). GPT-3.5: gpt-3.5-Web (web UI, Oct 17, 2023), gpt-3.5-API (gpt-3.5-turbo-0613, default), gpt-3.5-API-0 (gpt-3.5-turbo-0613, temperature=0), gpt-3.5-ft (fine-tuned gpt-3.5-turbo-0613, temperature=1), gpt-3.5-ft-0 (same fine-tuned model, temperature=0). Fine-tuning used IO-style prompts with rationales derived from AAOS recommendations; dataset details provided in supplementary materials. Outcomes: Primary—consistency with AAOS guideline recommendation strength (percentage of responses matching guideline level). Subgroup analyses by evidence level (strong, moderate, limited, consensus). Secondary—reliability measured by Fleiss’ kappa across five repetitions per item. Invalid data handling: Category A (no final rating) treated as an independent classification in kappa analyses; Category B (non-integer rating) recorded as its numeric value (e.g., 2.5). In scatter plots, Category A flagged as missing. Noted high invalid rates for certain settings (e.g., 81.1% missing for 0-COT with gpt-3.5-API-0), while gpt-4-Web had few invalid entries (14/680 across prompts). Analysis: Computed overall and per-level consistencies for each prompt-model combination; conducted multiple comparisons within models (e.g., gpt-4-Web) and applied Bonferroni corrections where appropriate. Reliability reported with kappa and 95% CIs.
Key Findings
- Overall performance: gpt-4-Web outperformed other models. Across its four prompts, consistencies ranged ~50.6%–63%. The best overall combination was gpt-4-Web with ROT prompting at 62.9% consistency. - Top combinations (examples): gpt-4-Web-ROT (62.9%); gpt-4-Web-IO (55.9%); gpt-3.5-ft-0-IO (55.3%); gpt-4-Web-0-COT (≈52.9%); gpt-4-API-0-ROT (51.2%); gpt-4-Web-P-COT (50.6%); Bard-0-COT (44.1%); gpt-3.5-Web-P-COT (43.5%). - Model families: In GPT-4 variants, ROT was consistently the best-performing prompt (≈35.3%–63%). In GPT-3.5 variants, IO was generally best (≈27.1%–55.3%), except gpt-3.5-Web where P-COT led (43.5%). Bard’s best prompt was 0-COT (44.1%). Across prompts, GPT-4 models ranged 8.8%–62.9%; GPT-3.5 models 4.7%–55.3%; Bard 19.4%–44.1%. - Subgroup by evidence level: For gpt-4-Web on strong-level items, ROT (77.5%) and P-COT (75%) were significantly higher than IO (30%). For limited-level items in gpt-4-Web, P-COT (50%) was significantly lower than ROT (75%) and IO (82.5%) after Bonferroni correction. Moderate-level items in gpt-4-Web showed no significant differences among prompts (30%–40%). Consensus-level items (n=2) were reported descriptively without statistical testing. - Reliability (Fleiss’ kappa): Across all prompts/models, kappa ranged from -0.002 to 0.984. IO prompting in gpt-3.5-API-0 and gpt-3.5-ft-0 achieved near-perfect reliability (0.984 and 0.982, respectively), indicating near-identical repeated answers. gpt-4-API-0 with P-COT showed substantial reliability (0.660). gpt-4-Web prompts had fair-to-moderate reliability (0.334–0.525). Many other settings were ≤ moderate (<0.6), highlighting variability in repeated responses. - Invalid data: High rates occurred in some settings (e.g., 81.1% missing final ratings for 0-COT in gpt-3.5-API-0), while gpt-4-Web had relatively few invalid outputs (14/680 across prompts).
Discussion
The study shows that prompt engineering materially affects LLM adherence to clinical guidelines and repeatability of responses. GPT-4 accessed via the web interface generally outperformed API-accessed GPT-4 and other models, suggesting that deployment environment and internal parameters (e.g., temperature) influence outcomes. Differences between gpt-4-Web, gpt-4-API (temperature=1), and gpt-4-API-0 (temperature=0) underscore the role of parameter tuning in clinical question answering. Fine-tuning improved performance only under certain prompting conditions (e.g., IO for gpt-3.5-ft-0), and performance dropped when other prompts were used, indicating alignment between fine-tuning data structure and inference-time prompts matters. ROT prompting—a structured, reflective, multi-expert simulation—performed best on gpt-4-Web overall and especially on strong-evidence items, likely by encouraging error checking and consensus formation. However, ROT was not universally optimal across models (e.g., GPT-3.5 variants favored IO; Bard favored 0-COT), highlighting model-specific prompt effectiveness. Reliability analyses showed that repeated answers can vary substantially; near-deterministic behavior occurred mainly with IO at temperature=0 for GPT-3.5. This supports practical recommendations to ask LLMs multiple times to capture variability and to consider low-temperature settings for deterministic needs. Collectively, results indicate that maximizing LLM clinical performance requires integrating prompt engineering with appropriate model selection, parameterization, and fine-tuning strategies. The work also emphasizes the need to refine prompts (e.g., ROT) for specific clinical tasks and to develop robust evaluation frameworks.
Conclusion
This study systematically evaluated how different prompting strategies and model configurations affect LLM consistency with AAOS OA guideline recommendations and response reliability. The gpt-4-Web with ROT prompting achieved the highest overall consistency (62.9%) and excelled on strong-evidence items (77.5%). Prompt effectiveness was model-dependent, and reliability varied widely, with only certain temperature=0 GPT-3.5 setups showing near-perfect repeatability. Findings suggest that carefully designed prompts can improve medical answer quality, but should be paired with suitable model settings and fine-tuning. Future work should optimize prompts like ROT for specific clinical scenarios, establish objective benchmark and human-in-the-loop evaluation frameworks, and develop guidance for patients and clinicians on robust, model-appropriate prompting in medical contexts.
Limitations
- No real-time interactions or validation with healthcare professionals or patients; evaluation relied on guideline-based expected answers without prospective clinical validation. - Potential prompt robustness issues: minor wording changes within the same framework may alter outputs; effectiveness is model-dependent. - Limited generalizability beyond OA and AAOS guidelines; consensus-level subgroup had small sample size (n=2) and no inferential testing. - High invalid-output rates in certain model–prompt combinations (e.g., 0-COT with gpt-3.5-API-0), which may bias reliability and consistency estimates. - Fine-tuning dataset design aligned with IO prompting may have favored IO at inference; fine-tuning may not ensure correct reasoning chains, leading to plausible but incorrect rationales. - Differences between web and API deployments and unspecified internal settings could confound comparisons across models. - The study did not construct or apply a standardized external benchmark or human expert rating framework beyond the guideline concordance metric.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny