logo
ResearchBunny Logo
Question-based computational language approach outperforms rating scales in quantifying emotional states

Psychology

Question-based computational language approach outperforms rating scales in quantifying emotional states

S. Sikström, L. Valavičiūtė, et al.

Discover how innovative natural language processing (NLP) techniques outshine traditional rating scales in accurately categorizing emotional states. This groundbreaking research by Sverker Sikström, Leva Valavičiūtė, Inari Kuusela, and Nicole Evors reveals that word-based responses can significantly enhance our understanding of emotions like depression, anxiety, satisfaction, and harmony.

00:00
00:00
Playback language: English
Introduction
The quantification of psychological constructs heavily relies on closed-ended rating scales. However, these scales suffer from limitations such as one-dimensionality, central tendency bias, halo effects, and limitations in self-observation. Open-ended language responses offer a more holistic representation of mental states, potentially capturing diverse symptom presentations not fully encompassed by diagnostic manuals like the DSM-V. Manual analysis of open-ended responses is time-consuming and prone to bias. Advances in natural language processing (NLP) and machine learning (ML) provide efficient tools for analyzing language data, offering a potential alternative to rating scales. This study investigates whether a question-based computational language assessment (QCLA), using NLP to analyze descriptive word responses to open-ended questions, provides more accurate categorization of emotional states than traditional rating scales.
Literature Review
The dominance of closed-ended rating scales in behavioral science research (87% in a sample of Journal of Personality and Social Psychology articles) is questioned due to their inherent limitations. Open-ended language responses, analyzed with NLP, are proposed as a superior alternative. Previous research comparing language-based and rating-scale methods has primarily used rating scales as the outcome measure, limiting the ability to determine which approach has higher validity. Recent advancements in NLP, particularly transformer-based models like BERT, have improved the accuracy and efficiency of analyzing language data for predicting mental health outcomes. These models have shown promise in various applications, including sentiment analysis, text classification, and mental health diagnostics. Studies have shown correlations between text-based answers analyzed computationally and rating scales like PANAS, Ryff's scales, SWLS, and others. The QCLA approach uses open-ended questions generating text that's transformed into quantifiable vectors using NLP, capturing the severity and description of mental states. However, past language-based methods haven't achieved the accuracy of rating scales because their validation used rating scales as the outcome measure.
Methodology
This two-phase study involved 731 participants (297 in Phase 1, 434 in Phase 2). Phase 1 participants wrote autobiographical narratives about a self-experienced episode of depression, anxiety, satisfaction, or harmony, summarized it with five descriptive words, and completed relevant rating scales (PHQ-9, GAD-7, SWLS, HILS). Phase 2 participants read Phase 1 narratives, described the emotional state in five words, and completed the same rating scales. Descriptive words were quantified using BERT embeddings. Multinomial logistic regression categorized responses into emotional states based on word embeddings, rating scale scores, or both. A 10-fold cross-validation procedure assessed classification accuracy. Inter-rater reliability was compared between rating scales and word responses. A subset of Phase 2 participants (34 healthcare professionals) was compared to the non-professional group (400). Word clouds visualized words significantly associated with each emotional state. Data preprocessing involved correcting spelling errors and removing irrelevant responses. The BERT embeddings (768 dimensions) underwent singular value decomposition for dimensionality reduction before classification. The optimal number of dimensions was determined in each 10-fold cross-validation. Statistical analyses included chi-squared tests, multinomial logistic regression, and Pearson correlation. A semantic t-test generated word clouds.
Key Findings
The study's primary hypothesis, that word responses would show superior predictive capability compared to rating scales, was supported. In Phase 2 (non-professionals), word responses achieved significantly higher accuracy (64%) in categorizing emotional narratives compared to rating scales (44%). Using individual rating scale items yielded even lower accuracy (30%). The second hypothesis, that combining word and rating scale data would increase accuracy, was not supported; the accuracies were similar (63% vs. 64%). Accuracy differed significantly between word responses and rating scales for satisfaction and anxiety, with rating scales performing particularly poorly for anxiety (15% lower accuracy across all conditions). The third hypothesis, concerning higher inter-rater reliability for word responses, was also supported. Word responses exhibited significantly higher correlations between Phase 1 and Phase 2 compared to rating scales, for all scales examined (HILS, SWLS, PHQ-9, GAD-7). The exploratory hypothesis comparing professionals and controls showed no significant difference in accuracy between word and rating scale methods, but the small professional sample size (N=34) limited the power of this comparison. Accuracy measures (correct categorizations/total categorizations) were significantly higher for word responses (82-86%) compared to rating scales (66-79%), as were precision measures (proportion of correct positive categorizations). Confusion matrices revealed more errors in the rating scale model, particularly misclassifying depression as anxiety. Pearson correlations between multinomial coefficient estimates were higher for the rating scale model, indicating greater confusion among emotional states in the rating scale analysis. Word clouds effectively displayed words strongly associated with each emotional state.
Discussion
The findings show that NLP analysis of open-ended, five-word responses provides more accurate categorization of emotional states than traditional rating scales. This suggests that language-based measures may have higher validity than rating scales. The superior performance of word responses was consistent across several analyses, with large effect sizes observed in Phase 2. The limitations of rating scales in capturing the nuances of anxiety were highlighted. The better discrimination between closely related emotional states by word responses suggests a more fine-grained assessment of emotional experience. The increased accuracy and precision of the word-based approach offer a valuable alternative for mental health assessments. The study's strength is the use of an independent outcome criterion (narrative content) to compare the validity of word responses and rating scales, unlike previous research that validated language-based measures against rating scales.
Conclusion
This study provides strong evidence for the superior validity of QCLA using five-word responses compared to traditional rating scales in categorizing emotional states. The NLP-based approach offers several advantages: language is a natural mode of expressing mental states, it provides a more person-centered approach, and it is efficient to administer. Future research should explore QCLA's applicability in classifying narratives into specific diagnoses and investigate the use of alternative language models. Ethical considerations regarding large-scale NLP analysis of mental health data are crucial.
Limitations
The study's reliance on participants' recall of past emotional episodes may introduce recall bias. Laypersons' understanding of terms like 'anxiety' and 'depression' may differ from clinical definitions, potentially affecting the comparison between self-reported and clinically assessed data. The small sample size of healthcare professionals limits conclusions about their assessment accuracy. Constraining responses to single words might limit the richness of information captured; future research should investigate the use of short phrases. The study design prioritized the establishment of a historical baseline for emotional experiences, potentially sacrificing the capture of subtle current emotional fluctuations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny