logo
ResearchBunny Logo
A framework for human evaluation of large language models in healthcare derived from literature review

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

This study by Thomas Yu Chow Tam, Sonish Sivarajkumar, and colleagues delves into the critical evaluation of Large Language Models in healthcare. It highlights the need for robust human evaluation methodologies and proposes the innovative QUEST framework to bridge existing gaps, ensuring safer and more effective AI applications in health.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the need for robust, human-centered evaluation of large language models (LLMs) in healthcare, where traditional automatic metrics (e.g., BLEU, ROUGE, accuracy, F1, AUROC) fail to capture generative quality, clinical utility, and safety. As LLMs are increasingly used for clinical decision support, patient and medical education, and patient-provider question answering, the study highlights risks posed by reliance on automated metrics or LLM-based evaluators, and the absence of healthcare-specific human evaluation standards. The authors set objectives to systematically review human evaluation methodologies in healthcare LLM research, characterize evaluation dimensions and practices, and synthesize these insights into actionable guidelines culminating in the QUEST framework for standardized, rigorous human evaluation.
Literature Review
Prior work has largely emphasized automated evaluation with limited and inconsistent human evaluations. Reviews by Wei et al. (60 ChatGPT medical QA articles) and Park et al. (55 medical LLM articles; 36 with human evaluation) reported high-level summaries but lacked systematic synthesis of dimensions and metrics. Yuan et al. surveyed LLMs as healthcare assistants with limited coverage of expert evaluation. Awasthi et al. proposed a human evaluation approach not specific to healthcare. Existing AI reporting guidelines (CLAIM, STARD-AI, CONSORT-AI, MI-CLAIM) ensure transparency for AI studies but do not cover standards for human evaluation of LLMs in healthcare. This gap underscores the need for a specialized, standardized framework.
Methodology
Scoping review adhering to PRISMA-SCR. Data source: PubMed. Timeframe: January 1, 2018 to February 22, 2024, covering the emergence of modern LLMs (e.g., GPT-1 to GPT-4, LLaMA-2). Inclusion: peer-reviewed English-language journal and conference publications focused on healthcare applications of LLMs with human evaluation and explicit methodology. Exclusions: non-generative pretrained language models (e.g., BERT, RoBERTa), multimodal (image-to-text or text-to-image) generative studies, and article types without experiments (comments, preprints, editorials, letters, reviews, systematic reviews, news). Search strategy combined terms for generative LLMs, human evaluation, and healthcare with iterative refinement; certain ambiguous terms (e.g., Llama, Bard, Gemini, PaLM) were excluded due to high false positives. Screening occurred in two stages: title/abstract screening followed by full-text review emphasizing methodological detail on human evaluation. Flow: initial query combinations yielded 1191 results; after applying language, year, and article-type filters, 688 remained; 42 were excluded for access issues, yielding 142 final articles. The review extracted data on application areas, medical specialties, evaluator recruitment and blinding, sample sizes, evaluation dimensions and tools (e.g., Likert scales, specialized frameworks), and statistical analyses for inter-rater reliability and performance comparisons.
Key Findings
- Application domains: Clinical Decision Support (CDS) 28.1%; medical education/examinations 24.8%; patient education 19.6%; patient-provider QA 15.0%; others each <11.8%. Studies demonstrate broad use from bench to bedside, with illustrative CDS examples in differential diagnosis and management suggestions. - Specialties: Radiology led (n=12), followed by Urology (n=9), General Surgery (n=8), and Plastic Surgery, Otolaryngology, Ophthalmology, Orthopedic Surgery (n=7 each); Psychiatry (n=6); others <5, showing wide but uneven adoption. - Sample sizes and variability: Most studies evaluated ≤100 LLM outputs; an outlier (Moramarco et al.) evaluated 2995 sentences via MTurk with seven ratings per sentence to address annotator variability. Some studies introduced patient-specific variability (e.g., clinical notes or vignettes) to assess subgroup performance. - Evaluators: Recruitment was task-dependent; CDS studies, despite higher risk, had the lowest median number of evaluators and the second-lowest median sample counts, likely due to difficulty recruiting qualified experts. Patient-facing applications tended to include more evaluators and samples. An inverse relationship was observed between sample size and number of evaluators (Figure 5). - Blinding: Only 29% (41/142) reported blinded assessments; 14% (20/142) were unblinded; 56% (80/142) did not report blinding, indicating inconsistent methodology reporting and potential bias. - Evaluation tools: Predominant use of Likert scales (binary to 5- or 7-point) assessing qualities such as accuracy, empathy, clarity, harmfulness; error categorization (logical, informational, statistical). Specialized frameworks included SERVQUAL, PEMAT-P, SOLO, Wang & Strong, METRICS, CLEAR, and DISCERN; none comprehensively covered all dimensions needed for healthcare LLMs. - Comparative analyses: Many studies compared LLM outputs to human experts, guidelines, or other LLMs; 20% (n=29) compared multiple LLMs. Statistical tests frequently used included t-tests (17 studies), Mann-Whitney U (11), Chi-square (11), ANOVA (8), Shapiro-Wilk (6), Fisher’s exact (5), Kruskal-Wallis (5), Cohen’s Kappa (5), Wilcoxon signed-rank (3), and ICC (3). - Proposed framework: From synthesizing 17 evaluation dimensions into five principles (QUEST): Quality of Information (accuracy, relevance, currency, comprehensiveness, consistency, agreement, usefulness), Understanding and Reasoning (understanding, reasoning), Expression Style and Persona (clarity, empathy), Safety and Harm (bias, harm, self-awareness; fabrication, falsification, plagiarism), and Trust and Confidence (trust, satisfaction). The authors propose a three-phase QUEST workflow with planning, implementation/adjudication (including cyclical consensus and inter-rater reliability), and scoring/review, with example ED case studies.
Discussion
The review reveals significant gaps in current human evaluation of healthcare LLMs: small and variable sample sizes, limited evaluator recruitment (especially for high-risk CDS tasks), inconsistent use and reporting of blinding, mixed evaluation tools, and heterogeneous statistical practices. These deficiencies undermine reliability, generalizability, and clinical applicability, potentially affecting patient safety. By systematizing evaluation dimensions into the QUEST principles and outlining a practical workflow that includes evaluator training, standardized checklists, cyclical adjudication for consensus (e.g., targeting Cohen’s kappa ≥0.7), and rigorous statistical analysis, the framework directly addresses the shortcomings. The approach encourages alignment with clinical context, stakeholder involvement, and task-specific criteria, and recommends comparing human assessment with automatic metrics for holistic evaluation. Adoption of QUEST can foster more transparent, reliable, and ethically grounded deployment of LLMs in healthcare.
Conclusion
This study provides a comprehensive scoping review of 142 studies on human evaluation of LLMs in healthcare, identifies systemic methodological gaps, and introduces the QUEST Human Evaluation Framework. QUEST consolidates 17 dimensions into five principles and operationalizes them across three phases—Planning; Implementation and Adjudication; and Scoring and Review—with actionable guidance on evaluator selection and training, checklist design, consensus processes, and statistical testing. Case studies in emergency medicine illustrate adaptability to clinical use cases. The framework aims to standardize and elevate human evaluation practices to improve safety, reliability, and effectiveness of healthcare LLMs. Future work should refine QUEST across specialties and institutions, integrate complementary automatic metrics, and develop tools to facilitate broader adoption and reporting standardization.
Limitations
Review limitations: potential omission of studies published after February 22, 2024; English-only inclusion; search terms/databases may not be exhaustive; exclusion of non-healthcare domains may omit relevant methods; access issues excluded 42 articles. Framework limitations: implementation may vary by specialty and institution due to policy, infrastructure, resource, and personnel constraints; QUEST dimensions may not capture all use-case nuances and should be adapted as needed; focus is on human evaluation and does not replace automatic quantitative evaluation; broader challenges include opacity of proprietary models and limited computational resources in healthcare settings that constrain thorough study and validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny