logo
ResearchBunny Logo
Artificial intelligence and social intelligence: preliminary comparison study between AI models and psychologists

Psychology

Artificial intelligence and social intelligence: preliminary comparison study between AI models and psychologists

N. S. Sufyan, F. H. Fadhel, et al.

This groundbreaking study conducted by Nabil Saleh Sufyan, Fahmi H. Fadhel, Saleh Safeer Alkhathami, and Jubran Y. A. Mukhadi compares the social intelligence of advanced AI models ChatGPT-4, Google Bard, and Bing against that of human psychologists. The results reveal that ChatGPT-4 outshines psychologists, signalling rapid advancements in AI's understanding of social behavior and its potential role in psychotherapy.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether artificial intelligence, specifically large language models (LLMs) such as ChatGPT-4, Google Bard, and Bing, can reach or exceed human levels of social intelligence (SI), an ability vital for effective counseling and psychotherapy. Against the backdrop of rapid AI advancements during the Fourth Industrial Revolution and the emergence of powerful LLMs, the authors contextualize SI as the capacity to understand others’ feelings, emotions, needs, and to act wisely in social situations. The purpose is to compare SI performance of LLMs to that of psychologists (bachelor’s and doctoral trainees) using a human SI measurement tool, assessing the implications for psychotherapy practice and potential AI integration.
Literature Review
Prior research shows AI can support psychotherapy through diagnosis, measuring mental/emotional disorders, suicide risk reduction, and empirical analysis (de Mello & de Souza, 2019; Lei et al., 2023; Morales et al., 2017; Horn & Weisz, 2020; Gottschalk, 1999; Zhai et al., 2005). AI has been shown to assess personality traits accurately and relate EEG to mental abilities and personality, supporting AI-based assessments (Kachur et al., 2020; Maksimenko et al., 2018). Conversational models (GPT-2, DialoGPT) can be adapted for psychotherapy dialogue quality (Das et al., 2022), and ChatGPT has demonstrated empathetic responses sometimes preferred over physicians’ answers (Ayers et al., 2023; Eshghie & Eshghie, 2023). However, concerns include underestimation of suicide risk and other diagnostic errors, hallucinations, inconsistent responses, and ethical issues; AI should be viewed as a tool rather than a replacement for therapists (Elyoseph & Levkovich, 2023; Woodnutt et al., 2023; Sedlakova & Trachsel, 2023; Grodniewicz & Hohol, 2023; Chang et al., 2023). OpenAI reports suggest GPT-4 performs at human-like or superior levels on various standardized exams, indicating potential for advanced cognitive and social capacities. The literature thus presents both promise and caution regarding AI’s role in mental health and SI evaluation.
Methodology
Design: Comparative study evaluating SI of AI LLMs versus human psychologists using a human SI scale. Participants: 180 male counseling psychology students from King Khalid University (Saudi Arabia) in 2023–2024; 72 bachelor’s students (age 20–28; mean 22.68±7.895) and 108 doctoral students (age 33–46; mean 40.55±6.288). Sampling: Stratified random sampling by education level. AI Models: Registered ChatGPT-4 (OpenAI, 2023), free Google Bard, and free Bing. Procedure: One evaluation per AI model on August 1, 2023. All human participants and each AI model completed the Social Intelligence (SI) Scale (Sufyan, 1998; Arabic), consisting of 64 items across two dimensions: (1) soundness of judgment of human behavior (41 true/false items), and (2) the ability to act in social situations (23 multiple-choice items with one correct answer). AI models were prompted to select answers for each of the 64 standard SI scenarios; responses were collected individually. Human participants completed the scale via an emailed link. Instrument and Psychometrics: The SI Scale was adapted from the George Washington University Brief Scale of SI. Content validity for the current study was reviewed by 10 psychology professors (minor grammatical modifications to items 1, 7, 12, 23). Construct validity: item-total point-biserial correlations ranged 0.39–0.48 (p<0.05); dimension-total Pearson correlations were 0.82 (dimension 1) and 0.73 (dimension 2). Test-retest reliability (n=20 undergraduates; 1-month interval; Spearman correction): 0.67 (dimension 1), 0.69 (dimension 2), 0.77 (overall). Scoring: Each item scored 0/1; dimension ranges: 0–41 and 0–23; total 0–64 (higher indicates higher SI). Statistical Analysis: IBM SPSS v28 used. Point-biserial correlations for item-total relations; Pearson correlations for dimension-total; independent samples test for reliability; one-sample t-tests comparing each AI model’s single total score to the human distributions (bachelor’s and doctoral groups). Descriptives (means, SDs, percentiles) used for rankings.
Key Findings
- Human SI means: Bachelor’s M=39.19, SD=7.927 (n=72); Doctoral M=46.73, SD=5.974 (n=108). - AI SI raw scores: ChatGPT-4=59; Bing=48; Google Bard=40. - One-sample t-tests: ChatGPT-4 significantly higher than both bachelor’s and doctoral groups (T=21.201 and 21.341; p<0.001). Bing significantly higher than both groups (Bachelor’s: T=9.426, p<0.001; Doctoral: T=2.207, p<0.05). Google Bard not significantly different from bachelor’s group (narrative), but significantly lower than doctoral group (T=11.709, p<0.001). - Percentile comparisons: ChatGPT-4 exceeded 100% of psychologists (both education levels). Bing outperformed 90% of bachelor’s and 50% of doctoral participants. Google Bard outperformed only 10% of doctoral participants and was approximately at the 50th percentile relative to bachelor’s participants. - Ranking of AI models by SI performance: ChatGPT-4 > Bing > Google Bard.
Discussion
Applying a human SI measure to AI LLMs, the study found that AI—especially ChatGPT-4—can meet or exceed human performance on SI tasks. Bing showed performance comparable to or slightly above doctoral-level participants and clearly above bachelor-level participants, while Google Bard’s performance was comparable to bachelor-level but below doctoral-level participants. These findings suggest rapid advances in AI’s capacity to understand emotions and social behavior, with implications for psychotherapy where SI is foundational for rapport, empathy, and decision-making. The authors note potential reasons for Bard’s lower performance at the study time (early stage of development and algorithmic maturity). They also discuss ethical and professional implications: while AI shows promise as an assistive tool, risks of diagnostic errors, hallucinations, and ethical concerns (confidentiality, empathy authenticity, client rights) necessitate guidelines, evaluation, and role definition. Overall, the results indicate that AI can meaningfully assist psychotherapists, and the profession should adapt to integrate AI safely and effectively.
Conclusion
ChatGPT-4 and Bing demonstrated higher SI than psychologists at both bachelor’s and doctoral levels, whereas Bard performed on par with bachelor-level and was outperformed by doctoral-level psychologists. AI model performance ranked: ChatGPT-4 highest, followed by Bing, then Google Bard. These results support the feasibility of evaluating AI entities with human SI measures and indicate rapid AI progress in socially-relevant capabilities. The findings can inform psychotherapists’ use of AI in assessment and care planning. The authors call for further research evaluating SI and related personality aspects in AI models, including longitudinal tracking as models evolve and broader applications in assessment, diagnosis, and psychotherapy.
Limitations
- Psychometric re-validation sample for the SI scale was small and homogeneous; broader validity and reliability assessments are needed. - AI models were evaluated only once; no test-retest of AI performance due to rapid model updates, limiting assessment of temporal stability. - Different AI access levels: paid ChatGPT-4 versus free Bing and Bard may have influenced comparative performance. - Human sample comprised male counseling psychology students from a single Saudi university (bachelor’s and doctoral programs only), limiting generalizability and not representing practicing psychotherapists in Saudi Arabia or broader populations. - Difficulty obtaining a larger, more representative psychologist sample. Future studies should include diverse, larger samples and longitudinal designs.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny