Medicine and Health

Large Language Models and Empathy: Systematic Review

V. Sorin, D. Brin, et al.

This review synthesizes 12 studies on whether large language models can demonstrate empathy. It reports that ChatGPT-3.5 and other LLMs can recognize emotions and provide emotional support across medical and general contexts—sometimes outperforming humans—while noting limitations like repetitive phrasing, prompt sensitivity, and variable evaluation methods. The research was conducted by the authors present in <Authors> tag.... show more

Introduction

The study investigates whether large language models can demonstrate empathy, a core human capability crucial for effective health care communication and outcomes. Empathy encompasses cognitive components (understanding another’s feelings) and affective components (experiencing emotions in response to others). In medicine, empathy improves patient satisfaction and adherence and fosters trust. As LLMs are increasingly integrated into health care workflows and educational tools, their limitations include errors, biases, and security concerns, and they have been suggested to lack empathy. Recent work has evaluated LLMs on emotional intelligence, theory of mind, and empathy, with early evidence suggesting elements of cognitive empathy such as emotion recognition and supportive responses. This review aims to systematically assess the literature on LLMs’ capacity to demonstrate empathy.

Literature Review

Prior literature indicates LLMs excel in tasks like summarization, question answering, and text generation, and are being integrated into clinical settings (e.g., EHR applications). Some studies evaluating emotional intelligence and theory of mind report mixed performance across tasks and models, but suggest LLMs may display aspects of cognitive empathy, including emotion recognition and supportive communication. Early clinical applications and commentaries highlight both opportunities and concerns, including model bias, susceptibility to adversarial attacks, and challenges in evaluating soft skills. Commercial LLM-based tools are emerging to provide emotional support. Collectively, the literature motivates a focused synthesis on empathy-related capabilities, differentiating cognitive versus affective empathy and considering implications for health care communication.

Methodology

A systematic search was conducted across MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv for studies published between December 2022 and February 2024 using the query: ((("large language models") OR (llms) OR (gpt) OR (chatgpt)) AND ((empathy) OR ("emotional awareness") OR ("emotional intelligence") OR (emotion)) OR (("social robots") OR ("artificial emotional intelligence") OR ("emotional artificial intelligence") OR ("emotional chatbots") OR ("affective computing") OR (HRI) OR ("Human robot interaction"))). Reference lists of relevant papers were also searched. Inclusion criteria: English-language full-length original articles evaluating empathy within LLM outputs; exclusion criteria: nonoriginal articles and studies focusing only on emotion recognition or theory of mind without explicit empathy evaluation; only LLMs considered (not other NLP algorithms). Two reviewers independently screened titles/abstracts and full texts, resolving disagreements by discussion. PRISMA guidelines were followed, yielding 12 studies included (all published in 2023). Data extracted covered models used, empathy task performance, evaluation methods (human ratings and automatic metrics like ROUGE and BLEU), blinding of reviewers, sample sizes, and reported limitations, summarized in tables and narrative synthesis.

Key Findings

• Twelve studies (all 2023) met inclusion criteria; ChatGPT-3.5 was evaluated in all, with six studies comparing to other LLMs (GPT-4, LLaMA variants, and fine-tuned chatbots). Seven studies focused on medical contexts. • LLMs exhibited elements of empathy, particularly cognitive empathy: emotion recognition, context-specific supportive responses, and empathetic dialogue generation. • Comparative performance: In Ayers et al., ChatGPT responses to 195 patient social media questions were preferred over physicians in 78.6% of cases and rated significantly higher in quality and empathy; physician responses were 41% less empathetic. • Fine-tuned and alternative models: SoulChat achieved mean empathy scores of 1.84–1.90 (scale 0–2) versus ChatGPT at 1.62–1.65; ChatGPT-powered psychiatric chatbots received mean human ratings up to 3.43/4 for empathetic responses. • Emotional awareness: ChatGPT-3.5 scored significantly higher than human population norms on the LEAS across all scales, with performance improving over a one-month interval toward near-maximal LEAS. • Soft skills examination: GPT-4 correctly answered 90% of USMLE soft-skill questions, outperforming ChatGPT and past human performance. • When compared to supervised models for emotion recognition, supervised models surpassed ChatGPT, but ChatGPT produced longer, diverse, and context-specific responses; human preference varied by comparator (e.g., preferred over EmpSOA in 54.33% of cases but only 16% over MISC). • In primary care patient messaging, GPT-4 and ChatGPT-3.5 outperformed a fine-tuned LLaMA-65B model on empathy; all LLMs significantly outperformed physician-generated responses. • Prompting and enhancements (Chain-of-Empathy, prompt engineering, external emotion classifiers, in-context learning with interactive generation and knowledge bases) improved empathetic expression and emotion labeling accuracy. • Evaluation methods included human subjective ratings, correctness on standardized questions, and automatic metrics (ROUGE, BLEU). Blinding of reviewers varied across studies. • Reported limitations across models included repetitive empathetic phrasing, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and subjectivity of evaluations influenced by evaluator background.

Discussion

Findings indicate LLMs can emulate aspects of cognitive empathy by recognizing emotions and producing supportive, context-aware responses, addressing the core research question of whether LLMs demonstrate empathy. This aligns with broader evidence of emerging social intelligence capabilities in AI. The results emphasize the distinction between cognitive and affective empathy: LLMs more consistently match the cognitive dimension, with uncertain capacity for genuine affective empathy given their lack of emotional experience. Evaluation challenges persist due to nonstandardized metrics, potential conflation of response length with perceived empathy, and cultural/contextual influences on empathy perception, including awareness of AI interlocutors. In health care, integrating empathetic communication via LLMs could enhance patient satisfaction, adherence, and outcomes, particularly in tasks prioritizing cognitive empathy and professional communication. However, rigorous, standardized assessment frameworks are needed to compare models and validate empathetic capabilities across diverse contexts and populations.

Conclusion

LLMs exhibit elements of cognitive empathy, including emotion recognition and generation of emotionally supportive responses across clinical and nonclinical scenarios. Advancing empathetic AI could improve patient experiences and support health care communication, potentially enhancing adherence and outcomes. Future research should refine models to strengthen empathy-related performance, develop standardized and objective evaluation strategies for soft skills, examine cultural and contextual generalizability, and clarify the role and sufficiency of cognitive versus affective empathy in human-AI interactions.

Limitations

Review-level limitations include reliance on predominantly subjective assessments precluding meta-analysis; focus on studies explicitly evaluating empathy (excluding related theory-of-mind work); and concentration on ChatGPT-3.5 with limited evaluation of other models (notably few assessments of LLaMA variants and GPT-4), constraining generalizability to alternative or newer LLMs. Model-level limitations reported across included studies include repetitive empathetic phrases, occasional failure to follow initial instructions, tendency toward overly lengthy responses, prompt sensitivity, and human evaluator subjectivity influenced by background and expectations.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Large Language Models for Mental Health Applications: Systematic Review

Z. Guo, A. Lai, et al.

Computer Science

Large language models empowered agent-based modeling and simulation: a survey and perspectives

C. Gao, X. Lan, et al.

Linguistics and Languages

DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS

K. Mahowald, I. A. Blank, et al.

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny