logo
ResearchBunny Logo
The language of happiness in self-reported descriptions of happy moments: Words, concepts, and entities

Linguistics and Languages

The language of happiness in self-reported descriptions of happy moments: Words, concepts, and entities

A. Moreno-ortiz, C. Pérez-hernández, et al.

Explore the fascinating language of happiness in this insightful study by Antonio Moreno-Ortiz, Chantal Pérez-Hernández, and María García-Gámez. Utilizing text analytics on the HappyDB corpus, the research uncovers how sentiment words and semantic classes reveal our sources of happiness, shining a light on the influence of commercial products and services. Dive in to discover the linguistic expressions that shape our joy!

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates how happiness is expressed linguistically in English self-reports, focusing on two issues: (1) the actual role and prevalence of sentiment-laden words in descriptions of happy moments, and (2) the identification of the sources of happiness as reflected in concepts and entities. Traditional self-reports of happiness face recall and cultural/linguistic comparability issues. With social media and large user-generated text, sentiment analysis often relies on sentiment lexicons and classifiers, but these have limitations: coverage gaps, lack of multiword expressions, context valence shifters (negation, irony, sarcasm), and the fact that happy events can be described without overt positive words. The paper positions itself within prior psychological and linguistic work (e.g., Seligman’s tripartite model of happiness; roles of social relationships, work, leisure; distinctions between experiencing vs. remembering selves) and emphasizes that concepts of happiness are language-specific (English focus). The aims are not to measure emotions per se, but to understand the verbal expression of happiness via quantitative evidence, identifying and quantifying linguistic elements such as entities and activity types. The study addresses three research questions: (1) What is the relevance of sentiment-laden words in happy moment descriptions? (2) Do different happiness categories use different types of sentiment words? (3) What sources of happiness are mentioned, and how are they linguistically materialized?
Literature Review
The paper reviews sentiment analysis resources and approaches used to detect polarity and emotions in text, including lexicons (Harvard General Inquirer, MPQA, Bing Liu’s Opinion Lexicon, SentiWordNet, SO-CAL, EmoLex, VADER, SenticNet) and machine learning methods, notably Transformer-based models (e.g., BERT). It details key limitations of lexicon-based methods: coverage gaps (especially for multiword expressions), lemma/form/stem format issues, and context effects via valence shifters and discourse phenomena (irony, sarcasm), which hinder accurate sentiment detection. It further notes that positive events can be described without explicit positive words. Prior work on HappyDB includes LIWC and VAD analyses, the CL-Aff shared task on affective content, and studies linking demographic features to happiness categories. Psychological literature cited includes Seligman’s authentic happiness framework, distinctions between experiencing and remembering selves (Kahneman), and research on social, work, and leisure correlates of happiness, as well as cross-linguistic semantic differences in happiness concepts (Wierzbicka).
Methodology
Corpus and data preparation: The study uses the HappyDB corpus (original n=100,922 happy moments). The authors identified substantial noise (copy-pasted texts, duplicates, long paragraphs/essays, low-literacy entries) and performed semi-automatic cleansing, yielding 91,608 moments. HappyDB’s seven categories (ACHIEVEMENT, AFFECTION, BONDING, ENJOY THE MOMENT, EXERCISE, LEISURE, NATURE) were assigned via a logistic regression classifier trained on 15,000 manually labeled moments (reported F1s range 0.54–0.92). The study also summarizes prior sentiment analyses (LIWC; VAD). Basic text statistics were computed with spaCy (en_core_web_lg) and Pandas. Sentiment analysis: Two approaches were applied. (1) Machine learning: HuggingFace Transformers pipeline with DistilBERT fine-tuned on SST-2 provided binary polarity (POSITIVE/NEGATIVE) with confidence scores; cases with confidence <0.9 were treated as NONE for comparability. (2) Lexicon-based: Lingmotif 2.0, a wide-coverage, manually curated English sentiment lexicon (28k single-word forms, 38k multiword expressions) with sentiment shifters, accessed via Python/REST. The lexicon-based method outputs polarity and lists of sentiment items. Comparisons assessed overall and per-category positivity rates and examined frequent positive/negative items. Keyword extraction and clustering: To identify sources of happiness, the authors extracted keywords per category using PyTextRank (TextRank variant), filtering artifacts (e.g., “amazon”, “mechanical”, “turk”, punctuation). Outputs were aggregated over batches due to spaCy’s 100k character Doc limit; scores were summed across batches. Top-n keywords were vectorized using spaCy’s en_core_web_lg (300-d embeddings). Noise words were removed (e.g., collapsing “a new cell phone” to “cell phone”). K-means clustering was applied within each category’s keyword vectors, with the number of clusters guided by the Silhouette Coefficient and adjusted via qualitative inspection. 2D visualizations used t-SNE (and UMAP for global visualization), and clusters were manually labeled for conceptual consistency. Named Entity Recognition (NER) and entity analysis: A hybrid approach combined neural, transition-based NER (per Lample et al., 2016) with morpho-syntactic pattern matching (e.g., proper noun sequences). Entities were categorized (ORG, UNK, GPE, PERSON, PRODUCT, WORK_OF_ART, LOC, FAC, EVENT). Entity counts, unique entities, and entity-per-moment ratios were computed by category, and top entities interpreted in relation to keyword clusters. Statistical analysis and visualization: Chi-square goodness-of-fit tested per-category sentiment error distribution; embeddings visualizations used TensorFlow Embeddings Projector (UMAP).
Key Findings
- Dataset and basic metrics: After cleansing, n=91,608 happy moments. Length and lexical POS distributions showed high variance in tokens/words and a relatively low average number of adjectives, suggesting limited explicit evaluative language. - Sentiment classification: When harmonizing outputs (treating low-confidence ML predictions as NONE), both DistilBERT and Lingmotif yielded similar proportions of POSITIVE items overall, but far fewer than expected for a corpus of happy moments. Many moments lacked positive words or contained contextually negative terms used in narratives of positive experiences. - Per-category sentiment differences: Chi-square goodness-of-fit showed highly significant differences in error (non-positive classification) rates across categories: χ²(6)=423.10, p<0.001. AFFECTION, BONDING, and ACHIEVEMENT exhibited lower error rates (more positive words), whereas LEISURE, ENJOY THE MOMENT, EXERCISE, and NATURE had higher error rates, indicating greater use of non-sentiment or contextually negative language in describing happiness. - Sentiment lexicon items: Frequent positive words were generic (happy, good, nice, enjoy, love). Category-specific positive items aligned with topical content (e.g., work/school in ACHIEVEMENT; affectionate expressions in AFFECTION; friendship terms in BONDING; indulgent foods in ENJOY THE MOMENT; health/fitness in EXERCISE; entertainment in LEISURE; weather in NATURE). Negative items were more content-specific (e.g., sick, stress, tired), not simply antonyms of positive terms, supporting the notion of negativity bias and richer descriptive detail for negative vocabulary. - Keyword clusters (sources of happiness): Across categories, 42 semantic classes were identified; 10 were shared by two or more categories (e.g., Communication & media, Entertainment products, Feelings, Food, Games, Goals, School, Work, Time, Misc.). ACHIEVEMENT (12 clusters) emphasized consumer products (mobile/electronics, vehicles, household), entertainment products, work/school, food, money, health, and goals. AFFECTION (9 clusters) focused on close/extended family, celebrations, family activities/trips, leisure, food, time, work/school, and feelings. BONDING highlighted friends, food hangouts, vacations, celebrations, school, entertainment, and communication/media. ENJOY THE MOMENT emphasized treats/comfort food, entertainment, meals/groceries, work, events/activities, with minimal references to people. EXERCISE focused on exercise types/muscles, workouts, yoga/meditation, places/time, goals, weight/diet, distance, feelings. LEISURE centered on entertainment (movies, TV, games) with smaller food/music classes. NATURE emphasized weather, landscapes (flora/fauna), and outdoor activities. - Named entities: Total entities=17,725; unique=10,716. Entity-per-moment ratios varied strongly: LEISURE 0.42 (highest), EXERCISE 0.04 (lowest); mean ≈0.19; NATURE 0.11. ORG was the most common label, often reflecting consumer brands/services; LEISURE had high ORG/PRODUCT/WORK_OF_ART, suggesting leisure happiness relies heavily on consumer products and media. ACHIEVEMENT entities included popular tech devices/brands (e.g., iPad, PlayStation 4, Kindle, MacBook Pro), e-commerce (Ebay, Amazon), social media (Instagram, Facebook, YouTube), games (World of Warcraft, League of Legends), and food brands/restaurants (Starbucks, Taco Bell). AFFECTION/BONDING entities aligned with celebrations (Valentine’s Day, Mother’s Day) and travel destinations (Disneyland, Grand Canyon). ENJOY THE MOMENT/LEISURE referenced sports events (Olympics, World Cup), festivals, fast-food chains, social media, and streaming platforms (Netflix), with titles of shows/films (Star Wars, Walking Dead, Game of Thrones). NATURE entities emphasized places/parks and nature-related facilities. - Overall: Happiness descriptions often rely on neutral or even negative lexical items within narratives, not merely on positive words. Sources of happiness span social/familial events, personal achievements/activities, entertainment/consumer goods, food, and nature, with notable cross-category prominence of “newness” and food.
Discussion
Findings address the research questions as follows: (1) Sentiment-laden words have limited explanatory power for happiness descriptions; many happy moments lack explicit positive lexemes or include negative terms as narrative context, leading to lower-than-expected positive classifications by both ML and lexicon-based systems. (2) Differences in sentiment word usage across categories are evident: frequently used positive terms are generic, while negative terms are more specific and richly descriptive, consistent with negativity bias. Negative vocabulary often highlights category-specific barriers or threats (e.g., injury, debt, tiredness) that frame or enhance the salience of positive outcomes. (3) Sources of happiness, identified through keyword clusters and NER, reveal distinct conceptual classes per category and shared classes across categories, supporting a dual view of happiness: external (socially/culturally mediated, e.g., family/friends, celebrations, entertainment, places) and internal (individual accomplishments, exercise, nature experiences). Named entity results underscore the prominence of consumer products, services, and media in happiness narratives, suggesting a strong consumerist/material dimension to perceived happiness. These insights emphasize that automated happiness measurement should consider conceptual and entity-level signals beyond sentiment lexemes and leverage clustering and NER to capture sources of happiness.
Conclusion
The study provides methodological foundations for analyzing the language of happiness in English self-reports. It shows that positive lexical items are an inadequate sole proxy for happiness, as happy moments often rely on neutral or even negative language within narrative contexts. By extracting and clustering keywords and analyzing named entities, the study identifies compact, semantically coherent classes that represent sources of happiness across categories, including family/friends, entertainment/consumer products, work/school, exercise, nature, and especially food and “newness” (new products/experiences/places). The strong presence of commercial products and services highlights a consumerist dimension in happiness narratives. Future research could: (a) refine cross-linguistic comparisons to assess cultural semantics of happiness; (b) improve handling of multiword expressions, contextual shifters, and discourse phenomena (irony/sarcasm); (c) integrate richer entity linking and product taxonomy mapping; (d) develop hybrid models that combine sentiment, conceptual classes, and entities for more accurate happiness characterization and prediction; and (e) address demographic biases and validation against external well-being measures.
Limitations
- Sampling and demographic bias: HappyDB was crowdsourced via MTurk, skewed toward US residents and younger age groups (20–40), with unknown native-speaker and socioeconomic status. These factors may influence vocabulary and activity types described. - Data quality: Presence of copy-pasted/duplicate content and long essays required cleansing; residual noise may remain. Automatic category labels (logistic regression) have variable accuracy (e.g., F1=0.54 for ENJOY THE MOMENT), potentially affecting downstream analyses. - Language scope: The study focuses on English; findings and conceptual classes are not assumed to generalize cross-linguistically. - Sentiment analysis constraints: Lexicon coverage gaps, limited multiword expression handling (though Lingmotif includes many MWEs), and context/discourse phenomena (negation, irony, sarcasm) can misclassify polarity. ML classifiers act as black boxes, limiting interpretability. - Keyword/cluster methodology: Choices of top-n, clustering parameters, and manual labeling introduce subjectivity; some classes (miscellaneous) indicate unclustered or semantically diffuse items. Embedding quality and domain mismatch may affect cluster compactness. - Entity analysis: NER label granularity (e.g., ORG conflating companies with brands/services) may blur distinctions; entity linking to knowledge bases was not performed, limiting semantic normalization.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny