logo
ResearchBunny Logo
Large language models are proficient in solving and creating emotional intelligence tests

Psychology

Large language models are proficient in solving and creating emotional intelligence tests

K. Schlegel, N. R. Sommer, et al.

Large Language Models like ChatGPT-4 matched or surpassed human performance on five standard emotional intelligence tests and even generated new test items with comparable difficulty. Research conducted by Katja Schlegel, Nils R. Sommer, and Marcello Mortillaro shows LLMs produce responses consistent with accurate knowledge about human emotions and their regulation.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether large language models (LLMs) can (a) solve performance-based emotional intelligence (ability EI) tests and (b) generate new items with comparable psychometric properties to established tests. Emotions are central to social functioning, and ability EI—skills in recognizing, understanding, and managing emotions—is linked to positive outcomes across domains such as workplace performance. Affective computing has achieved progress in emotion recognition and applications, but current systems remain narrow. The rise of LLMs capable of human-like dialogue suggests potential for more general affective AI. Prior work showed ChatGPT outperforming humans on the Levels of Emotional Awareness Scale and performing well on theory-of-mind tasks. The authors hypothesized that multiple LLMs would outperform average human validation samples on five EI tests and that ChatGPT-4-generated test versions would exhibit psychometric properties broadly comparable to original tests, with at most small differences in difficulty, internal consistency, clarity, realism, item diversity, and correlations with vocabulary and other EI measures.
Literature Review
The paper situates its work within affective computing, noting advances in automatic emotion recognition across modalities and applications in healthcare, education, and workplace contexts. It highlights debates about LLMs’ capacity for empathy and psychological constructs (personality, theory of mind, emotions), citing evidence that LLM outputs align with psychological knowledge and can be rated higher than human professionals for empathy and quality in certain contexts. Prior studies include ChatGPT scoring above humans on the Levels of Emotional Awareness Scale and LLM performance on false-belief theory-of-mind tasks. The authors argue that cognitive empathy (accurate reasoning about emotions and regulation) is a prerequisite for effective socio-emotional AI. They also review psychometric test development approaches and item response theory validations of brief EI tests (STEU-B, STEM-B), informing expectations for GPT-generated item quality.
Methodology
The research comprised two parts. Part 1 assessed LLMs’ accuracy on five ability EI tests: Situational Test of Emotion Management (STEM; 44 items), Situational Test of Emotion Understanding (STEU; 42 items), Geneva EMOtion Knowledge Test—Blends (GEMOK–B; 20 items), GECo Emotion Regulation (28 vignettes; two adaptive vs. two maladaptive options, scored 0/0.5/1), and GECo Emotion Management (20 vignettes; five conflict strategies). LLMs tested: ChatGPT-4, ChatGPT-01, Copilot 365, Claude 3.5 Haiku, Gemini 1.5 Flash, and DeepSeek V3. Each LLM was prompted with original instructions and items (chunked where needed due to character limits), solving each test 10 times in separate conversations (December 2024–January 2025). Mean scores and SDs across trials were computed and compared to human validation sample means via independent-samples t tests (preregistered for ChatGPT-4 single-run comparison; additional LLMs and repeated trials not preregistered). Part 2 evaluated ChatGPT-4’s ability to generate new items and their psychometric properties. Using the first of the 10 ChatGPT-4 trials as basis, the model was provided correct answers for any incorrectly solved items and instructed to generate new items matching formal criteria and define correct answers for each test (target numbers: STEM-B 18 items, STEU-B 19, GEMOK 20, GECo Regulation 28, GECo Management 20). Final prompts and generated items are in Supplementary Material. Similarity rating study (not preregistered): N = 434 native English Prolific participants (UK/US) rated similarity (1–7) between all pairings of original and GPT-created scenarios (no response options), across 3174 scenario pairs grouped into 64 parcels (~50 pairs each) with 3 attention checks (paraphrased pairs expected ratings 6–7). Exclusions were applied for failed attention checks or nonsensical task descriptions. The distribution of highest similarity ratings per GPT scenario was calculated. Psychometric validation studies: Five separate online studies on Prolific, each administering both the original and GPT-generated version of one EI test, plus (a) clarity and realism ratings per item (0–100 sliders), (b) a card-sorting task to assess item content diversity (average number of piles; up to 12, except a GECo Regulation original version error with 24 allowed; participants creating >12 piles excluded for that metric), (c) a 20-item StuVoc vocabulary test (two attention checks; Cronbach’s alpha 0.70–0.84, mean 0.80), and (d) an additional EI test assessing the same dimension for construct validity (pairings: STEM-B with GECo Management; STEU-B with GEMOK; GEMOK with STEU-B; GECo Regulation with STEM-B; GECo Management with STEM-B). Order of original vs GPT versions and associated ratings/card sorting was randomized; vocabulary and other EI test followed in fixed order. Participants were native English speakers (US/UK), prevented from participating in multiple studies; total N = 467 after exclusions (per-study Ns and demographics in Table 3). Exclusion criteria: survey time <15 min, any test score ≥3 SD below mean, or failing both StuVoc attention checks. Compensation averaged £9/hour. Ethics approval: Faculty of Human Sciences, University of Bern (ID 20230803); preregistrations: https://osf.io/mgqre/registrations. Analytic approach: Pooled t tests (SPSS v27) for test difficulty, clarity, realism, and item content diversity; fixed-effects multilevel meta-analyses (R v4.4.1, metafor REML) for internal consistency using average item-total correlations (Fisher z) and construct validity (correlations with StuVoc and another EI test), with test type (original vs GPT) as moderator. Two one-sided tests (TOST, TOSTER) for equivalence with predefined SESOI: d ±0.20 for mean differences; r ±0.15 for correlations. FDR corrections applied for test-level p values. Additional measures included ICC across LLMs and item-level human–LLM correlation.
Key Findings
LLM solution performance: Across five EI tests, mean LLM accuracy was 81% versus 56% for human validation samples; all LLMs performed >1 SD above human means, with ChatGPT-01 and DeepSeek V3 >2 SD. Item-level agreement among LLMs was high (ICC = 0.88). Human vs LLM item difficulty correlated at r = 0.46 across 105 items, indicating similar cue utilization. Per-test LLM vs human differences (Table 5) showed large effects: STEM d = 4.077, STEU d = 1.543, GEMOK-Blends d = 1.139, GECo Regulation d = 3.100, GECo Management d = 1.898; all p < 0.001 (FDR-corrected). Similarity of GPT-generated scenarios: Among 105 GPT-created scenarios, 12% had highest similarity ratings ≥5 with any original scenario; 88% did not reach a high similarity threshold, indicating GPT did not merely paraphrase originals. Psychometric comparisons (pooled across five studies, N = 467): - Test difficulty: No significant mean difference, t(466) = -1.223, p = 0.222; equivalence confirmed within d ±0.20, t(466) = 19.61, p < 0.001; d = -0.057 [-0.147; 0.034]. At test level, GPT-created versions were easier for STEM-B and GECo Regulation; originals were easier for GEMOK-Blends and GECo Management; STEU-B showed no significant difference. - Clarity: No significant pooled difference, t(466) = -1.799, p = 0.073; equivalence not established; d = -0.083 [-0.174; 0.008]. Clarity rated higher for GPT versions in all tests except GEMOK-Blends. - Realism: GPT versions slightly higher, t(466) = -2.746, p = 0.006; d = -0.127 [-0.218; -0.036]. Test-level: GPT higher for STEU-B, GECo Regulation, GECo Management; original higher for GEMOK-Blends; STEM-B no difference. - Item content diversity (card sorting): Originals perceived as more diverse, t(456) = 4.446, p < 0.001; d = 0.208 [0.115; 0.301]. Significant test-level differences for GEMOK-Blends, GECo Regulation, GECo Management favoring originals. - Internal consistency: Moderator test not significant (QM = 0.635, df = 1, p = 0.426). Equivalence within r ±0.15 not established (z = -0.071, p = 0.139). Average item–total correlations: original r = 0.183, GPT r = 0.259 (small difference; CI exceeded equivalence bounds). Cronbach’s alpha: original STEU-B > GPT; GPT GECo Regulation > original (χ² = 76.224, p = 0.005). - Correlations between original and GPT test versions: Weighted mean r = 0.46, p < 0.001. - Construct validity: Correlations with StuVoc—weighted mean r original = 0.244 (p < 0.01) vs GPT = 0.137 (ns); moderator not significant (QM = 2.651, p = 0.104); equivalence not established (z = -0.047, p = 0.236); small effect size d = 0.217 [-0.044; 0.492]. Correlations with another EI test—weighted mean r original = 0.323 (p < 0.001) vs GPT = 0.236 (p < 0.001); moderator not significant (QM = 2.189, p = 0.149); equivalence not established (z = -0.064, p = 0.164); d = 0.197 [-0.064; 0.471]. Individual test-level differences in these correlations were not significant but underpowered. Overall, GPT-generated tests were largely comparable to originals, with confirmed equivalence only for test difficulty; differences elsewhere were small and none exceeded medium effect sizes.
Discussion
Findings show that multiple contemporary LLMs substantially outperform average human validation samples on diverse ability EI tests, suggesting that LLMs’ responses are aligned with accurate emotion knowledge and regulation strategies. The moderate correlation between human and LLM item accuracies indicates similar reliance on item cues, hinting at convergent reasoning patterns. ChatGPT-4 was able to generate entire situational judgment tests with generally acceptable psychometric properties using few prompts. While equivalence was affirmed for test difficulty, GPT tests tended to have slightly higher realism and clarity, somewhat lower item content diversity, and small differences in internal consistency and construct validity indices relative to originals. The strong correlations between original and GPT versions imply they assess similar constructs. Implications include the potential for LLMs to serve as emotionally intelligent agents in applied domains (healthcare, education, customer service), delivering consistent, high-level cognitive empathy through reliable emotion understanding and management. For assessment development, LLMs can accelerate item generation, even for complex formats (e.g., conflict management strategies; blended emotions with multi-component cues), though human-led piloting and validation remain essential to refine item pools and optimize construct validity. The study does not address affective empathy (feeling with others), but many applications may not require it to achieve intended outcomes. The results support integrating LLMs into socio-emotional tools and bolster arguments that LLMs could be components of broader AGI systems.
Conclusion
The study demonstrates that six LLMs markedly exceed human averages on five established ability EI tests, and that ChatGPT-4 can generate new EI test items producing tests with psychometric properties largely similar to originals, with confirmed equivalence for test difficulty and small differences elsewhere. These findings suggest LLMs can reliably produce responses consistent with accurate emotion knowledge and regulation. Future research should examine performance in complex, less-structured real-world emotional interactions; evaluate cross-cultural applicability and adaptations; improve transparency and explainability; and continue validating GPT-assisted test development workflows to optimize item diversity and construct validity.
Limitations
- The tasks used were standardized, structured performance tests; real-world emotional interactions are more ambiguous, context-rich, and may challenge LLMs, which can be sensitive to prompt changes. - Context integration over longer dialogues (e.g., user history, personality, preferences) was not assessed; many studies rely on single prompts. - Cultural bias: Tests were developed in Western contexts, and LLM training data are largely Western-centric; emotion expressions, display rules, and regulation strategies vary across cultures, potentially limiting generalizability. - Black-box nature of LLMs limits insight into reasoning processes; model updates or data changes could alter performance or reduce creativity/diversity in item generation. - While GPT-generated tests were broadly comparable, some showed lower item content diversity and slightly weaker associations with vocabulary and other EI tests; individual test-level analyses for correlations were underpowered. - The item generation process relied on ChatGPT-4’s outputs and iterative prompt engineering; despite attention to formal criteria, human expert review and pilot testing remain necessary to optimize psychometric quality.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny