logo
ResearchBunny Logo
Introduction
Generative AI, particularly LLMs like GPT-4 and LLAMA, are rapidly gaining popularity in healthcare due to their ability to generate coherent responses. They hold potential for revolutionizing healthcare data interpretation, processing, and application across various tasks, including medical question answering, clinical decision support, patient monitoring, risk assessment, and health education. However, current quantitative evaluation metrics are insufficient for assessing the generative nature and clinical utility of LLMs. Human evaluation is essential for ensuring reliability, accuracy, safety, and ethical use in healthcare. Existing literature lacks a standardized framework for this crucial human evaluation process, leading to inconsistent and unreliable assessments. This study addresses this gap by conducting a systematic review to identify and analyze human evaluation approaches for LLMs across diverse medical domains and tasks, synthesizing best practices for designing and conducting rigorous, reliable, valid, and ethical human evaluations, and providing actionable guidelines for developing standardized human evaluation frameworks for healthcare uses of LLMs. The study aims to establish a foundation for catalyzing further research into human evaluation of LLMs in healthcare.
Literature Review
The study reviewed 142 studies on human evaluation methods for LLMs in healthcare. Previous reviews focused on automated metrics or lacked systematic summaries of evaluation dimensions and metrics. While some proposed human evaluation methods, none were specifically designed for healthcare. Existing reporting guidelines for AI in healthcare (CLAIM, STARD-AI, CONSORT-AI, MI-CLAIM) do not specifically address human evaluation of LLMs. This current study aims to fill this critical gap.
Methodology
This scoping review adhered to PRISMA-SCR guidelines. The literature search spanned publications from January 1, 2018, to February 22, 2024, focusing on peer-reviewed journal articles and conference proceedings published in English on PubMed. The search included terms related to "Generative Large Language Models," "Human Evaluation," and "Healthcare." Inclusion criteria focused on peer-reviewed articles in English discussing human evaluation of LLMs in healthcare, with explicit details on evaluation methodologies. Exclusion criteria included non-English articles, irrelevant article types (e.g., comments, editorials, preprints), and articles not directly focusing on LLMs. A two-stage screening process (title/abstract and full-text) was used to select 142 articles for review. The study examined various aspects of human evaluation methodologies, including evaluation dimensions, sample types and sizes, evaluator selection and recruitment, frameworks and metrics, evaluation process, and statistical analysis. The authors categorized evaluation methods into 17 dimensions grouped into five principles (QUEST): Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence. The study also analyzed the use of checklists, evaluation samples (size and variability), evaluator selection, blinding techniques, statistical analysis, and specialized evaluation frameworks.
Key Findings
The review revealed a wide range of healthcare applications for LLMs, with clinical decision support (CDS) being the most prevalent (28.1%), followed by medical education and examination (24.8%), patient education (19.6%), and medical question answering (15%). Radiology was the leading medical specialty utilizing LLMs (n=12), followed by Urology (n=9), General Surgery (n=8), and others. Evaluation methodologies varied widely, often blending quantitative and qualitative measures. The study found a lack of standardization in evaluation design, sample sizes (mostly ≤100), and evaluator selection. Many studies did not report blinding procedures. Various statistical methods were used (t-tests, Mann-Whitney U tests, Chi-Square tests, ANOVA, etc.), often depending on data distribution and evaluation objectives. The review also identified several specialized frameworks (SERVQUAL, PEMAT-P, SOLO) used in the evaluation but emphasized their limited scope compared to the comprehensive QUEST framework proposed in this study. A significant inverse relationship was found between evaluation sample size and the number of human evaluators, highlighting a potential challenge in recruiting sufficient and qualified evaluators to assess a large number of LLM outputs. The study underscores the importance of considering both controlled and real-world scenarios in evaluating LLM performance. The use of blinded assessments was infrequent and inconsistently reported, highlighting the need for standardized blinding procedures.
Discussion
The findings highlight the critical need for standardized human evaluation frameworks for LLMs in healthcare to ensure reliability, validity, and ethical considerations. The lack of standardization observed in the reviewed studies raises concerns about the generalizability and reproducibility of findings. The QUEST framework addresses these limitations by providing a structured approach that guides the evaluation process, encompassing planning, implementation, adjudication, scoring, and review phases. The framework’s five principles (QUEST) provide a comprehensive set of dimensions for assessing LLM performance. The proposed framework is designed to be adaptable and implementable across various healthcare settings and specialties, but its effectiveness depends on addressing the variations in resources and expertise among different institutions. The integration of human and automated evaluation methods is crucial for achieving a balanced and comprehensive assessment of LLMs. The study's limitations include potential bias due to the language restriction (English) and the search strategy limitations, alongside the rapidly evolving nature of the field, which might lead to the omission of recent publications. Future work should focus on validating and refining the QUEST framework and investigating the generalizability of these findings to other populations and healthcare settings.
Conclusion
This study reveals significant heterogeneity in existing human evaluation methods for LLMs in healthcare. The proposed QUEST framework provides a structured and comprehensive approach to address these limitations, promoting consistency and reliability in evaluating LLMs for clinical applications. Future research should focus on validating the framework's effectiveness and addressing its potential limitations regarding its broad applicability across diverse healthcare settings and specialties.
Limitations
This review may have missed articles published after February 22, 2024, and is limited to English-language articles. The search strings and databases used might not have been fully comprehensive. The review did not consider articles using LLMs outside healthcare, which could provide valuable insights. The QUEST framework, while comprehensive, may not fully capture the nuances of every use case, and its implementation may vary across different clinical settings. The framework focuses primarily on human evaluation and does not explicitly address the integration of automatic evaluation methods.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny