logo
ResearchBunny Logo
Question-based computational language approach outperforms rating scales in quantifying emotional states

Psychology

Question-based computational language approach outperforms rating scales in quantifying emotional states

S. Sikström, L. Valavičiūtė, et al.

Discover how innovative natural language processing (NLP) techniques outshine traditional rating scales in accurately categorizing emotional states. This groundbreaking research by Sverker Sikström, Leva Valavičiūtė, Inari Kuusela, and Nicole Evors reveals that word-based responses can significantly enhance our understanding of emotions like depression, anxiety, satisfaction, and harmony.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether open-ended, language-based responses analyzed with NLP can more accurately assess emotional states than traditional closed-ended rating scales. Motivated by limitations of rating scales—such as one-dimensionality, central tendency, halo effects, and constrained self-observation—the authors propose a person-centered approach using descriptive words to capture nuanced emotional experiences. They compare language-derived measures against standard scales for depression (PHQ-9), anxiety (GAD-7), satisfaction (SWLS), and harmony (HILS) using an outcome criterion independent of rating scales: participants’ self-generated emotional narratives. The pre-registered hypotheses were: H1) descriptive word responses would outperform rating scales in categorizing emotional states; H2) combining words and rating scales would improve accuracy over either alone; H3) words would show higher inter-rater reliability than rating scales when matched on narratives; H4, exploratory) healthcare professionals might show stronger associations with rating scales than non-professionals.
Literature Review
The paper situates its contribution within research highlighting the dominance and limitations of rating scales in psychology and growing skepticism about their validity for capturing complex human experiences. Open-ended language responses can flexibly reflect diverse symptom expressions beyond diagnostic checklists (e.g., DSM-related critiques). NLP and ML have advanced predictive capabilities across domains and have been widely applied to mental health using clinical notes, electronic health records, and social media. Transformer-based models, especially BERT, substantially improved contextual semantic representations. Prior work shows that text-based answers can predict well-being and affect scales (e.g., PANAS, SWLS), and that tasks like the Emotional Recall Task correlate with affect measures. DASentimental mapped psychometric measures to user-generated word sequences, and Question-based Computational Language Assessment (QCLA) has provided valid severity measures and nuanced descriptions of mental states. However, prior validations often relied on correlations with rating scales themselves, making it difficult to assess whether language measures surpass scales in validity. This study instead uses a ground-truth criterion (self-specified narrative emotion) to directly compare the two approaches.
Methodology
Ethics and preregistration: Approved by the Regional Ethics Board in Lund, Sweden (Dnr 202104627), conducted under the Declaration of Helsinki. Written informed consent obtained. Study design, hypotheses, and analysis plan preregistered on OSF (https://osf.io/6fx72; 2022-03-15). Participants: Recruited via Prolific; inclusion: age ≥18, native English speakers. Phase 1: 350 completed; 53 failed attention checks; final N=297. Phase 2: 465 completed; 31 excluded; final N=434, including 34 healthcare professionals (doctor, nurse, paramedic, pharmacist, psychologist, social worker, emergency medical employee). Combined final N=731 (women=428; men=281; gender undisclosed=22), age 18–79 (M=31.97, SD=12.71). Design: Two-phase paradigm. Phase 1 participants wrote an autobiographical narrative (≥~5 sentences) about a specific past period experiencing one assigned emotion (depression, anxiety, satisfaction, or harmony; assignment randomized and balanced), then provided five single-word descriptors of the emotional content (prohibited from using the target emotion word), and completed four scales adapted to the described past period: PHQ-9 (depression), GAD-7 (anxiety), SWLS (life satisfaction), HILS (harmony). Phase 2 participants read a Phase 1 narrative (each read at least once, no more than twice), then provided five descriptive words and completed the same rating scales, adapted to the author’s emotional state (wording adjusted from “you” to “the author/they”). Each Phase 2 participant evaluated one narrative. Four attention-check items were embedded (one per scale) in both phases; any failure led to exclusion. Measures: PHQ-9 (0–27), GAD-7 (0–21), SWLS (5–35), HILS (5–35), with standard item formats but instructions tailored to the narrative’s period (Phase 1: self past period; Phase 2: author’s past period). Five descriptive words were constrained to single tokens (no phrases); spellcheck corrections applied when meaning was clear. Procedure: Data collected via Qualtrics; reCAPTCHA used. Participants informed about study purpose, anonymity, withdrawal rights, and researcher contacts. Narratives had no word limit (M words ≈81.32, SD=40.66). Minimal manual cleaning addressed typos/repetitions; no narratives removed. Participants instructed not to use the assigned emotion word in narratives or word lists. NLP and ML analysis: Descriptive words quantified using BERT (bert-base-uncased), using 768-dim embeddings from the last layer. Feature sets for classification: (a) words-only embeddings; (b) rating scale totals (4 dims); (c) combined words+rating totals (772 dims). To address high dimensionality and ensure fair comparison, Singular Value Decomposition (SVD) dimensionality reduction was applied to all feature sets (including rating-only, though not necessary computationally). Multinomial logistic regression classified each response into one of four emotional states. Ten-fold cross-validation ensured train/test separation, with folds constructed to prevent the same narrative from appearing in train and test. In each training fold, the optimal number of SVD dimensions was selected by minimizing misclassifications over candidate dimensions: 1, 2, 3, 5, 7, 10, 14, 19, 26, 35, 46, 61, 80, 105, 137, 179, 234, 305, 397, 517, 768. Mean optimized dimensions: words-only M=70 (SD=14); rating-only M=3.64 (SD=0.48). Assumed normality of data distributions (not formally tested). Additional analyses: For H3, multiple linear regression predicted continuous scores of each scale (HILS, SWLS, PHQ-9, GAD-7) from descriptive words, using the same cross-validated approach. Word clouds were generated using a semantic t-test: constructing normalized difference vectors between each emotion and the other three, computing cosine similarities for each unique word with 10-fold leave-out, and testing whether similarities exceeded zero (Bonferroni-corrected). Confusion matrices and correlations among multinomial coefficients (Pearson r) examined model confusability. Analyses conducted via SemanticExcel.com; custom MATLAB code available on request; data available at OSF (https://osf.io/gdkcb).
Key Findings
Primary classification performance (Phase 2 non-professionals): - Words vs rating scales: Words achieved higher correct categorization (64%) than rating scales (44%); X²(1, 400)=16.10, p=0.0001, φ=0.20 [0.10, 0.29]. - Individual rating items (26 items total across PHQ-9, GAD-7, SWLS, HILS): 30% accuracy; X²(1, 400)=8.41, p=0.0037, φ=0.14 [0.05, 0.24]. - Combined words + rating scales: 64% vs words-only 63%; no significant difference; X²(1, 400)=0.04, p=0.8355, φ=0.01 [−0.09, 0.11]. Per-emotion differences (Phase 2 non-professionals): - Rating scales showed lower accuracy for satisfaction (X²(1, 100)=5.88, p=0.0154, φ=0.25 [0.06, 0.42]) and especially for anxiety (X²(1, 100)=45.63, p<0.01, φ=0.68 [0.55, 0.77]). Words showed reasonably high accuracy across all emotions (except satisfaction in the smaller professional subgroup). Healthcare professionals (Phase 2): - No significant difference: words 56% vs rating scales 50%; X²(1, 34)=0.12, p=0.7260, φ=0.09 [−0.25, 0.41]. Small N (34) limits conclusions. Accuracy and precision (Phase 2 non-professionals): - Overall accuracy higher for words vs rating scales: X²(1, 434)=4.04, p=0.0445, φ=0.10 [0.00, 0.19]. - Overall precision higher for words vs rating scales: X²(1, 434)=20.88, p<0.001, φ=0.22 [0.13, 0.31]. - Per emotion (percentages): Rating scales—Accuracy: harmony 76, satisfaction 66, depression 68, anxiety 79; Precision: harmony 43, satisfaction 35, depression 49, anxiety 29. Words—Accuracy: harmony 82, satisfaction 79, depression 80, anxiety 86; Precision: harmony 60, satisfaction 62, depression 64, anxiety 65. Confusion and correlation analyses (Phase 2): - Rating scales: most frequent error was predicting depression when true state was anxiety (N=64). Words: errors more evenly distributed; largest was predicting satisfaction when true state was harmony (N=24). - Correlations among multinomial coefficients were higher in rating-scale model (0.68<r<0.92) than words model (0.07<r<0.47), indicating greater confusability for rating-scale-based classification. Inter-rater agreement: - No significant advantage for rating scales over words in pairwise agreement for Phase 2 non-professionals (X²(1, 263)=3.50, p=0.062) or combined non-professionals across phases (X²(1, 721)=0.12, p=0.724). Phase-matched correlations for H3: - Descriptive-word-based predictions of HILS, SWLS, PHQ-9, GAD-7 correlated more strongly across phases than corresponding rating scales (all p<0.001 via Fisher r-to-z): HILS 0.76 vs 0.19; SWLS 0.76 vs 0.47; PHQ-9 0.77 vs 0.48; GAD-7 0.55 vs 0.17. Overall: Findings support H1 and H3; do not support H2; exploratory H4 not supported.
Discussion
Using an outcome criterion independent of rating scales—the target emotion used by Phase 1 authors when generating narratives—the study shows that five descriptive words analyzed with NLP can more accurately and precisely categorize emotional states than traditional rating scales. Word-based models displayed lower confusability among categories, aligning with the notion that language captures nuanced, context-dependent aspects of emotional experience better than one-dimensional numeric scales. Rating scales particularly struggled to identify anxiety in Phase 2, whereas word-based classification was more balanced across emotions. Inter-rater agreement was not superior for rating scales, and word-derived predictions showed stronger cross-phase correspondence with the same narratives’ latent scale constructs than the scales themselves, supporting reliability and validity of the language-based approach. Practically, open-ended responses offer person-centered, natural communication and can be brief (five words), making them suitable complements to rating scales in clinical and research assessment. Ethically, while language-based NLP presents opportunities for scalable assessment, transparent, regulated deployment is required to mitigate privacy risks and potential biases.
Conclusion
This study introduces and validates a two-phase paradigm for assessing emotional states, demonstrating that five descriptive words analyzed with NLP (BERT embeddings with multinomial regression) outperform traditional rating scales in categorizing narratives for depression, anxiety, satisfaction, and harmony. Word responses achieved 64% correct categorization versus 44% for rating scales in a large non-professional sample, with higher accuracy and precision and lower confusability. These results suggest that QCLA can provide a more precise and naturalistic assessment of psychological constructs and should complement traditional scales in mental health assessment. Future work should: evaluate performance on clinically diagnosed populations and real-world clinical texts; extend from emotion categorization to diagnostic classification; explore alternative language models (e.g., domain-adapted transformers) and phrase-level inputs; increase sample size and diversity; and examine demographic and contextual moderators of performance.
Limitations
- Lay versus clinical definitions: Participants may interpret “depression” and “anxiety” differently from DSM-based clinical practice, complicating direct comparisons with clinical questionnaires. - Recall bias and temporal context: Narratives describe past periods; recollection and current mood may distort reported experiences. Phase 2 interpretations depend on understanding of Phase 1 narratives. - Self-report dependence: Both narratives/words and adapted rating scales rely on self-report or reader inference, potentially limiting completeness and accuracy of emotional characterization. - Demographic stratification: No stratified analyses by gender, nationality, or education; potential demographic effects on language use and model performance remain unexamined. - Constraint to single-word descriptors: Restricting to single words may miss multi-word symptoms (e.g., “eating less,” “trouble concentrating”); future work should compare single words vs short phrases. - Professional subgroup size: Small N for healthcare professionals (N=34) limits conclusions about differences between professionals and non-professionals. - Potential bias transfer: Known biases in rating-scale-based diagnosis (e.g., disparities) must not be replicated in NLP-based assessments; requires active mitigation and monitoring. - Privacy/ethics: Large-scale NLP on personal text raises privacy concerns; ethical deployment requires strict data governance and transparency.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny