Computer Science

Narrative coherence in neural language models

A. Acciai, L. Guerrisi, et al.

Neural language models are probed for a human-like 'personal identity' by analyzing autobiographical narratives: researchers elicited 2,010 stories from GPT-3.5 and 2,010 from GPT-4, rated coherence with the NaCCS, and fine-tuned an automatic evaluator from psychologist annotations. Across 4,020 stories, models showed human-like narrative coherence, with GPT-4 slightly higher. This research was conducted by Authors present in <Authors> tag.... show more

Introduction

Neural Language Models (NLMs), enabled by Transformer architectures, are unique non-biological entities exhibiting cognitive-like performances in language tasks. Despite being engineered systems, their internal mechanisms underlying such abilities remain opaque, warranting scientific investigation similar to complex natural systems. This has led to the emerging program of machine psychology, which applies psychological methods to assess NLMs. Rather than adapting test-based instruments (e.g., questionnaires, decision-making, or problem-solving), this work leverages NLMs’ generative capacity to produce spontaneous autobiographical narratives and assesses them with NaCCS, a psychological tool that quantifies narrative coherence across Context, Chronology, and Theme. The research question asks whether NLMs’ narratives show coherence distributions indistinguishable from humans and whether coherence components modulate similarly across age, mood, and gender. Establishing such coherence would imply that NLMs can simulate a robustly integrated personal identity in narrative output, offering insights into their knowledge integration and linguistic-textual production beyond sentence-level correctness.

Literature Review

Narrative coherence is a hallmark of human language and cognition, requiring complex functions such as temporal reasoning, perspective-taking, and thematic elaboration. The NaCCS (Reese et al., 2011) provides a multidimensional framework—Context, Chronology, Theme—with scores from 0–3 per dimension, summing to a global coherence score. Prior work links narrative coherence to psychological health, autobiographical memory, identity formation, communication, and theory of mind capacities (McLean et al., 2010; Waters & Fivush, 2015; Lilgendahl & McAdams, 2011; Lind et al., 2020; McCabe & Peterson, 1991). NaCCS integrates classic accounts of global coherence (Labov, 1972; Baerger & McAdams, 1999). In humans, coherence develops through adolescence and may decline after age 50 (Harter & Leahy, 2001; Friedman, 2004; Reese et al., 2011). Mood influences coherence, with negative mood sometimes eliciting greater thematic processing (Vanderveren et al., 2019, 2020). Machine psychology has begun probing NLMs’ cognitive-like behaviors (Hagendorff, 2023; Binz & Schulz, 2023; Kosinski, 2023), including theory of mind tasks, causal and analogical reasoning, and decision-making. This study situates itself within these literatures by applying NaCCS to NLM-generated autobiographical narratives.

Methodology

Design: A multiple-factor experimental structure systematically manipulated independent variables: age, gender, and mood, comparing GPT-3.5 and GPT-4. Narrative Coherence Coding Scheme (NaCCS) scores three dimensions (Context, Chronology, Theme; each 0–3). Story Generation: Autobiographical narratives were elicited via structured role-play prompts tailored to four age ranges R = {Child (C), Teenage (T), Midlife (M), Adult (A)}. Variables included gender g ∈ {male, female}, name n (from two sets N_Q and N_O, each size 5, mapped to gender), mood m ∈ {positive (+), neutral (=), negative (−)}, and age y drawn from sets: Y_C = {3,4,5,6,8,11}, Y_T = {12,14}, Y_M = {20,24,36}, Y_A = {52}. Messages comprised a multi-turn dialogue culminating in a request for a ~300-word autobiographical episode under the specified mood induction. Sampling balanced across combinations, with age-specific repetition counts (N_r) to mitigate non-uniformity in y values; total stories per model: 2,010; across GPT-3.5 and GPT-4: 4,020. Automatic Coherence Evaluation: A pilot set of 96 stories per model (total 192) was rated by certified psychologists using NaCCS, forming pairs (c, δ(c)) where δ returns the score vector. From these, a textual evaluation format was constructed and used to fine-tune GPT-3.5 as an automatic NaCCS evaluator. The fine-tuned evaluator produced discursive outputs that were decoded into numerical scores via a designed parsing function, enabling automated assignment of Context, Chronology, and Theme scores to all 4,020 stories. Statistical Analysis: One-way and two-way ANOVAs tested effects of age (a), gender (g), mood (m), and interactions (e.g., a:m), followed by Bonferroni-corrected post hoc tests. Significance threshold α = 0.05.

Key Findings

Overall Coherence: Across 4,020 narratives, both GPT-3.5 and GPT-4 achieved human-comparable coherence, with GPT-4 consistently higher. Compared to human data, average coherence was GPT-3.5: 4.7 (SD 1.0); GPT-4: 6.1 (SD 0.3); Humans: 4.6 (SD 2.3) (Table 5). Age Effects: Significant effects of age across dimensions for all models—Context F(3,4016)=346.905, p<0.001; Chronology F(3,4016)=146.147, p<0.001; Theme F(3,4016)=36.105, p<0.001 (Table 2). Both models showed a downward trend in coherence from Child to Adult, more pronounced in GPT-3.5 [F(3,2006)=322.16, p<0.001] than GPT-4 [F(3,2006)=48.18, p<0.001]. Means by age (All GPTs, Table 3): Child coherence 5.90; Teenage 5.63; Midlife 4.58; Adult 4.26. GPT-4 maintained higher coherence across ages (e.g., Child 6.34; Adult 5.47) vs GPT-3.5 (Child 5.46; Adult 3.05). Mood Effects: Mood significantly influenced coherence positively; negative mood yielded the largest increases. For all models, overall coherence F(2,4017)=23.63, p<0.001; Theme F(2,4017)=147.58, p<0.001 (Table 2). Means (All GPTs, Table 3): Negative mood coherence 5.51 versus neutral 5.10 and positive 5.12; GPT-4 showed stronger mood effects (Negative coherence 6.40; Positive 5.89; Neutral 5.75). Gender Effects: No significant differences by gender across dimensions (e.g., Context F=0.564, p=0.452; Chronology F=0.074, p=0.786; Theme F=0.476, p=0.490; Table 2). Interactions: Age–Mood interaction significant for all GPTs across overall coherence and dimensions: overall coherence F(6,4008)=5.37, p<0.001; Context F=22.04, p<0.001; Chronology F=7.26, p<0.001; Theme F=23.88, p<0.001 (Table 4). Model Comparison: GPT-4 outperformed GPT-3.5 across dimensions and conditions (e.g., GPT-4 Context mean 2.2 vs GPT-3.5 1.7; Chronology 1.5 vs 1.0; Theme 2.4 vs 2.0; Table 5). Dimension Trends: Theme consistently scored highest; Chronology lowest. Context and Chronology declined with simulated age, especially from Midlife to Adult; Theme was more robust and strongly modulated by mood.

Discussion

The findings directly address the hypotheses. H1a is supported: GPT-3.5 and GPT-4 produce autobiographical narratives with coherence levels comparable to human samples, indicating that NLMs’ text generation is not merely syntactically impeccable but also globally coherent. H1b is largely supported at the dimensional level: NLMs show meaningful modulation across Context, Chronology, and Theme, though age-related trends diverge from human developmental curves (inverse pattern with higher coherence in younger simulated ages and declines at older simulated ages). H2a shows that simulated age influences coherence components in NLMs, albeit inversely to human developmental data, consistent with persona-based performance downgrading. H2b shows mood—particularly negative mood—enhances overall and thematic coherence, aligning with human findings on mood’s role in autobiographical processing. H2c confirms no gender effect, matching the human literature. The significance lies in demonstrating that NLMs can simulate integration of knowledge and perspective-taking consistent with narrative coherence metrics, suggesting sophisticated internal representations supporting multi-sentence discourse coherence. GPT-4’s stronger results further indicate scaling and model quality improve global coherence and thematic elaboration, reinforcing the view that Transformer-based NLMs exhibit emergent cognitive-like narrative capacities.

Conclusion

This study contributes a large-scale, psychologically grounded evaluation of narrative coherence in NLMs using NaCCS, combining human-rated pilot data with an automated evaluator to score 4,020 autobiographical narratives from GPT-3.5 and GPT-4. Both models, especially GPT-4, achieve coherence comparable to or exceeding human averages, with mood (particularly negative) enhancing coherence and gender showing no effect. Simulated age modulates coherence inversely to human developmental trends, highlighting persona-simulation effects. These results support the view that NLMs can produce narratively coherent text indicative of substantial knowledge integration. Future research should: expand to more models and modalities; refine and validate automated NaCCS evaluators across diverse corpora; probe causal mechanisms of coherence; investigate stability and variability of simulated personas; and explore theoretical implications for machine psychology and potential forms of artificial narrative identity with appropriate caution.

Limitations

Interpretation requires caution due to anthropomorphism: NLMs lack lived time and experience, so narrative coherence reflects simulation rather than personal identity. The automated evaluator was fine-tuned on a relatively small human-rated set (192 stories), potentially introducing bias or overfitting; parsing discursive outputs into numeric scores may add error. The study uses only GPT-3.5 and GPT-4, limiting generalizability. Age effects in models diverge from human developmental trajectories, complicating direct comparisons. Prompting choices, name/gender mappings, and sampling strategies may influence outputs. All narratives are model-generated, and coherence scoring relies on a model-based evaluator, which could systematically differ from human raters.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS

K. Mahowald, I. A. Blank, et al.

Psychology

Shared computational principles for language processing in humans and deep language models

A. Goldstein, Z. Zada, et al.

Computer Science

ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

M. Reza, P. Dushniku, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny