Interdisciplinary Studies
Analyzing Memory Effects in Large Language Models through the Lens of Cognitive Psychology
Z. Cao, L. Schooler, et al.
Memory is adaptive but fallible — and this study finds that cutting-edge language models echo many human memory quirks. Using classic human-memory paradigms, the authors show LLMs display list-length and list-strength effects, associative interference, DRM-style false recognitions, and cross-domain generalization, while differing in order sensitivity and resilience to nonsense. This research was conducted by Zhaoyang Cao, Lael Schooler, and Reza Zafarani.
~3 min • Beginner • English
Introduction
The study investigates whether and how large language models (LLMs) replicate core properties of human memory, including its well-documented adaptive errors. Grounded in Schacter’s taxonomy of memory “sins” and decades of cognitive psychology and neuroscience, the authors ask: Do LLMs exhibit human-like memory effects that arise from capacity limits, interference, order sensitivity, meaning, reconstructive processes (false memories), and generalization? Specifically, the research examines which adaptive features of human memory are mirrored by LLMs and where LLMs diverge, thereby illuminating the mechanisms of associative memory in artificial systems and informing future model design. Recent advances (e.g., GPT-4 and DeepSeek architectures) motivate this inquiry, as benchmarks suggest improved associative reasoning and retention capabilities. Rather than building a cognitive model of humans, the work uses human memory paradigms as an analytical lens to understand LLM behavior.
Literature Review
The paper situates its work within a rich literature on human memory: Schacter’s seven “sins” (transience, absent-mindedness, blocking, misattribution, suggestibility, bias, persistence) frame adaptive errors in memory. Foundational effects include list length/strength (capacity and competitive retrieval), fan effect (associative interference and slowed retrieval), nonsense effect (disadvantage of semantically incoherent stimuli due to lack of scaffolding), serial position effects (primacy and recency), and DRM false memories (semantic activation of related but unseen lures). Classic and computational models (e.g., ACT-R, SEF; Ebbinghaus; Bartlett; Shiffrin & Ratcliff; Anderson) explain these phenomena via associative networks, interference, and gist vs. verbatim memory. Recent work like Centaur demonstrates data-driven models that simulate broad human cognition, but focus largely on behavioral alignment rather than specific memory mechanisms. Reviews on LLMs and cognition highlight similarities and differences in language processing and memory-like behavior, motivating targeted tests of LLM recognition and recall using canonical paradigms.
Methodology
Models and setting:
- Online LLM: GPT-4 (‘gpt-4-0125-preview’) accessed via API; primary results reported due to strongest performance.
- Offline LLMs: Mistral-7B-Instruct-v0.3 and LLAMA-3-8B evaluated for reproducibility and constrained inference.
Evaluation metrics:
- Recall accuracy = (TP + TN) / (TP + TN + FN + FP)
- False-alarm rate = FP / (FP + TN)
- Hit rate = TP / (TP + FN)
Datasets and tasks:
1) Person–Location Lists (Dataset 1)
- Source: Augmented from Schneider & Anderson’s person-location pairs, using 40 persons and 40 locations (word lengths controlled). Facts follow “The <person> is in the <location>.”
- Lists: Study list (facts to memorize), foil list (non-studied facts via rearrangements preserving fan status), question list (“Is the <person> in the <location>?”) comprising studied and non-studied items.
- Sub-experiments:
• Experiment 1: List Length Effect. Varied study list size from 32 to 40 pairs (and corresponding question list of 64). Examined accuracy vs. memory load under fan value groups.
• Experiment 2: List Strength Effect. Increased strength via repetition of the first study fact (five repetitions) in groups with fan=1, fan=2, and an integrated group; assessed impact on recognition of stronger vs. weaker items.
• Experiment 3: Fan Effect. Fixed study list length at 32; compared fan=1 (unique pairings) vs. fan=2 (each person and location appears twice). Measured accuracy changes with associative load.
• Experiment 4: Nonsense Effect. Replaced words with nonsensical tokens (letters/numbers) while preserving original length in (a) persons only, (b) locations only, (c) both, across study and question lists. Tested robustness to semantically meaningless input.
- Prompting: Task-specific prompts instructed models to output CSV answers labeled answers_csv, responding “yes” to studied facts and “no” to foils. Offline models received adapted prompts for compatibility.
2) Target Words and Associates Lists (Dataset 2)
- Source: 12 target words with 15 ordered associates each (as per Roediger & McDermott). Lists ordered by associative strength.
- Sub-experiments:
• Experiment 5: Position Effect. Provided first 6 lists (12 words each). Immediate recognition tests with 42-word probes (12 studied; 30 non-studied including 6 critical lures, 12 unrelated, 12 weakly related from positions 14–15 of other lists). Measured recall rate across serial positions.
• Experiment 6: DRM False Memories. Study phase: 8 associate lists (8×15=120 words). Test: 48 words (24 studied from positions 1, 7, 15 of studied lists; 24 non-studied: 12 critical lures from 12 lists, 12 items from 4 non-studied lists). Old/New judgments with Remember/Know annotations when “Old”. Two conditions: immediate vs. delayed (brief math problems). Computed hit and false-alarm rates.
• Experiment 7: Generalization Ability. Assessed proportion of lure words from studied lists judged “old” as a measure of schema-consistent generalization.
Procedure:
- Study phase: models memorize lists.
- Delay manipulation (Dataset 2): simple math tasks before recognition (delayed condition).
- Test phase: recognition decisions per item; outputs compared to ground truth.
Controls and design details:
- Fan groups: Fan=1 group uses unique person-location pairs; Fan=2 group uses repeated associations; integrated groups mix fan levels.
- Question list size set to twice study list length in Dataset 1 (studied + rearranged foils).
- Nonsense replacements applied consistently to both study and test materials.
- Offline models: parameters frozen; repeated prompts yield identical outputs; prompts minimally adjusted for Dataset 2, more for Dataset 1.
Key Findings
Dataset 1 (Person–Location Lists):
- Fan effect: Recognition accuracy declines as fan value increases (associative load). At fixed list size (32), accuracy decreased from ~0.991 to ~0.915 when increasing fan, indicating greater interference.
- List length effect: Accuracy generally decreases as the number of studied facts grows from 32 to 40, consistent with capacity limits/interference.
- List strength effect: Repetition strengthens items; stronger (repeated) items show higher accuracy than weaker items. Accuracy for weaker items remains largely unaffected by the presence of stronger ones (negative/missing list-strength effect in recognition), paralleling human recognition findings.
- Nonsense effect: Replacing persons, locations, or both with nonsensical tokens has minimal impact on LLM recognition accuracy, indicating robustness to semantically meaningless inputs.
Dataset 2 (Targets and Associates):
- Hit rate for studied items: 1.000 for both immediate and delayed (post-arithmetic) tests.
- False-alarm rate for non-studied items: 0.053.
- Critical lure (DRM) “old” proportion: 0.114, evidencing false recognition driven by semantic activation.
- Position effect: LLMs recalled/recognized all studied words perfectly across positions (accuracy = 1), showing no primacy/recency pattern, unlike humans’ U-shaped serial position curves.
Online vs. offline models:
- Online GPT-4 outperforms offline Mistral and LLaMA across effects; offline models show similar qualitative trends but more fluctuations (e.g., list length and strength effects).
- DRM: Offline models achieve hit rate of 1.0 similar to GPT-4, but exhibit near-zero lure acceptance (poor generalization), recognizing lure words as new.
- Position: Offline models also show position-invariant recognition (perfect across positions).
Overall alignment/divergence (Table 2 summary):
- Alignments: list length, list strength (recognition), fan effect, DRM-style false memories, cross-domain generalization (partially).
- Divergences: superior robustness to nonsense inputs; minimal sensitivity to serial position (lack of primacy/recency).
Discussion
Findings indicate that LLMs, despite architectural differences from biological memory, exhibit human-like patterns under several canonical recognition paradigms. Decreases in accuracy with increased list length and associative fan suggest interference-like mechanisms. Strength manipulations reveal recognition dynamics akin to human data where stronger items benefit without harming weaker items, aligning with the “missing” list-strength effect in recognition. The DRM results show reconstructive memory behavior: semantic activation leads to false recognition of critical lures, mirroring human false memory formation.
Divergences provide insight into LLM mechanisms. Position insensitivity implies that LLM recognition within the tested prompt and context does not rely on serial order in the way human free recall does, likely reflecting attention and context representations that are less constrained by temporal encoding. Robustness to nonsense tokens suggests heavy reliance on structural task cues and prompt-driven matching rather than semantic scaffolding; models can perform recognition by surface-form matching or learned mapping even when semantic content is degraded. Differences between online and offline models—especially offline models’ poor lure generalization—highlight the role of model scale, training data, and inference-time behaviors in producing human-like reconstructive errors.
Collectively, results support using human memory paradigms to probe LLMs, clarifying where associative reconstruction in LLMs aligns with cognition and where architectural biases lead to distinct error profiles. These insights can guide cognitive-modeling applications and inform the design of LLMs that better capture adaptive, reconstructive properties of memory.
Conclusion
This work systematically applies seven canonical human memory paradigms to evaluate LLMs’ memory-like behaviors. It demonstrates substantial alignment with human patterns (list length/strength, fan effect, DRM false memories) and key divergences (position insensitivity, robustness to nonsense). GPT-4 shows the strongest performance, while offline models replicate trends but struggle with generalization to critical lures. The study underscores the utility of cognitive psychology experiments as diagnostics for LLM memory mechanisms and suggests avenues for improvement: incorporating positional encoding or training regimes that better capture primacy/recency, exploring architectures that balance gist and verbatim-like traces, and extending tests beyond recognition to more naturalistic recall tasks. Future work should (a) quantify reaction-time analogs or computational cost as a proxy for retrieval dynamics, (b) broaden stimuli (texts, images, multimodal) and tasks (free recall, cued recall), and (c) investigate training or inference augmentations that yield more human-like generalization and error tradeoffs.
Limitations
- Model scope: Primary analyses center on GPT-4 with two offline models; results may not generalize to all LLMs or future variants.
- Task modality: Experiments emphasize recognition over free recall; human primacy/recency effects are prominent in free recall paradigms.
- Prompt sensitivity: Performance depends on task-specific prompt engineering; slight changes could alter outcomes.
- Constructed datasets: Person–location and associate lists are constrained laboratory stimuli; ecological validity for real-world memory tasks may be limited.
- Metrics and outputs: Offline models with frozen parameters reduce variability across trials; lack of reaction-time measures limits comparison with human speed–accuracy tradeoffs.
- Generalization assessment: Lure acceptance rates reflect a narrow slice of “schema” generalization; broader cross-domain tests could yield different patterns.
Related Publications
Explore these studies to deepen your understanding of the subject.

