Psychology

Automating psychological hypothesis generation with AI: when large language models meet causal graph

S. Tong, K. Mao, et al.

This groundbreaking study by Song Tong, Kai Mao, Zhen Huang, Yukun Zhao, and Kaiping Peng uncovers a revolutionary method for generating psychological hypotheses by merging causal knowledge graphs with large language models. The researchers have successfully produced innovative hypotheses on well-being that rival expert insights, paving the way for automated discovery in psychology.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how to automate psychological hypothesis generation by integrating large language models (LLMs) with causal graphs. Psychology’s traditional reliance on theory-driven, expert-intensive methods limits the speed and breadth of discovery. LLMs can extract causal information from large text corpora, while causal graphs provide a systematic, interpretable structure to organize and reason over this knowledge. The research proposes building a comprehensive causal network of psychological concepts from tens of thousands of papers and using link prediction to generate novel, testable hypotheses, with a focus on well-being. The purpose is to evaluate whether combining LLMs with causal graphs can produce hypotheses matching human junior experts in novelty and surpass LLM-only outputs, thereby offering a scalable, data-driven complement to conventional hypothesis generation.

Literature Review

Background work highlights psychology’s contributions to and inspiration from AI (e.g., brain-like hierarchies and attention). Traditional psychological research, especially in social psychology, advances more slowly due to theory-driven methodologies and expert bottlenecks. Recent advances show LLMs can infer causality from text and that integrating causal graphs with LLMs improves causal reasoning and interpretability. Prior interdisciplinary efforts (e.g., physics) used large-scale text mining and semantic networks to forecast research directions. Causal graphs in psychology offer unified frameworks for multicausal modeling but are labor-intensive and require multidisciplinary expertise. This study builds on these insights to unify LLMs’ semantic extraction with causal graphs’ systemic reasoning to support automated, novel hypothesis generation in psychology.

Methodology

Framework: The LLM-based causal graph (LLMCG) approach comprises three stages: (1) literature retrieval, (2) causal pair extraction, and (3) hypothesis generation via link prediction. 1) Literature retrieval: Using the PMC Open Access Subset (>2M XML articles), the authors filtered psychology/neuroscience articles by title/abstract keywords (e.g., “psychol”) and metadata, curating ~140k articles. After cost analysis, 43,312 psychology articles (journals including Frontiers in Psychology, BMC Psychology, etc.; years 1975–2023) were selected where journal titles included “Psychol”. Text was extracted from PDFs using PyPDF2, concatenating titles, abstracts, and main text, and removing reference sections via regex heuristics to exclude non-relevant content. 2) Causal knowledge extraction: Articles were segmented into ≤4000-token chunks due to GPT-4 limits. Prompt engineering guided GPT-4 to extract concept pairs with relationship type (causality/correlation) and polarity (positive/negative) in JSON format, excluding relationships cited from other studies. Outputs were filtered and standardized; GPT-4 also assisted in verification of relevance, causality, and directionality. The process respected API rate limits (≈60 requests and 150k tokens per minute). An exploratory validation with four graduate students assessed GPT-4’s causality vs. correlation discrimination: of 289 relationships identified, 87.54% were validated; when GPT-4 labeled relationships as causal, 65.55% (156/238) agreed as causal and 13.02% (31/238) were non-relationships. Extracted knowledge was stored in a Neo4j graph database for scalable querying and graph analytics. Graph database: The resulting knowledge graph contains ~197k concepts and ~235k connections. Nodes represent standardized concepts with attributes; edges capture directed relationships. Core high-degree concepts include Depression, Anxiety, Life satisfaction, Well-being, Performance, etc., indicating balanced focus on mental illness and positive psychology. 3) Hypothesis generation via link prediction: Node embeddings (node2vec) mapped concepts into a vector space. Unconnected pairs received similarity scores; low scores were filtered out. Jaccard similarity over neighbor sets assessed linkage likelihood; candidate pairs were ranked by probability. High-probability pairs were used to prompt GPT-4 to formulate novel causal hypotheses bridging the concepts. The study generated 130 well-being-centered hypotheses. Evaluation: Four groups were compared—(a) Random-selected LLMCG (30/130 randomly sampled), (b) Expert-selected LLMCG (30 handpicked by two senior positive psychology researchers), (c) Control-Human (30 hypotheses from 13 psychology PhD students, randomly sampled from 41 collected), and (d) Control-Claude (30 selected by GPT-4 from 50 hypotheses generated by Claude-2 using standardized prompts). Three senior psychology professors (blind, randomized order) rated novelty and usefulness (z-scored per rater). Deep semantic analysis encoded hypotheses with BERT and visualized with t-SNE; semantic distances were statistically compared (ANOVA, Bonferroni post hoc). An ablation compared hypotheses generated by GPT-4 alone (n=60) versus LLMCG (n=60, random) with the same human rating protocol.

Key Findings

• Corpus and graph: From 43,312 psychology articles, GPT-4 extraction and curation yielded a Neo4j knowledge graph with ~197k concepts and ~235k directed relationships. Validation of GPT-4 extraction showed 87.54% of 289 relationships were confirmed by human evaluators; for GPT-4-labeled causal pairs, 65.55% were agreed causal and 13.02% were non-relationships. • Novelty superiority over LLM-only: Compared to LLM-only controls, LLMCG hypotheses achieved higher novelty. Bonferroni post-hoc comparisons for novelty showed significant differences: Random-selected LLMCG vs. Control-Claude t(59)=3.34, p=0.007, Cohen’s d≈0.88; Control-Human vs. Control-Claude t(59)=4.32, p<0.001, d≈1.12; trends also favored Expert-selected LLMCG vs. Control-Claude (t(59)=2.49, p=0.085, d≈0.62). ANOVA for mean novelty: F(3,116)=6.92, p=0.0002 (R²≈15%). Median and maximum-score analyses corroborated these effects; notably, maximum novelty showed Expert-selected LLMCG > Control-Claude (t(59)=3.12, p=0.014, d≈0.70). • Usefulness: Group differences in mean usefulness were not significant (F(3,116)=5.25, p=0.553), with distributions indicating similar practical value across groups; Expert-selected LLMCG tended toward higher usefulness concentrations but without robust between-group significance. • Ablation (LLMCG vs. GPT-4 alone): LLMCG markedly improved novelty over GPT-4-only across mean/median/max (mean: t(119)=6.60, p<0.0001; all Cohen’s d>1.1), with no significant difference in usefulness (mean: t(119)=1.31, p=0.1937). • Deep semantic analysis: BERT+t-SNE revealed distinct clustering of PhD students (shared training) and broader dispersion for LLMCG groups, indicating wider semantic coverage. Semantic distance ANOVA was significant (F(3,1652)=84.1611, p<0.00001; R²≈86.96%). Post hoc tests showed large differences: Control-Claude vs. Control-Human (t=16.41, p<0.0001, d=1.163), Control-Claude vs. Random-selected LLMCG (t=13.45, p<0.0001, d=1.070), Control-Claude vs. Expert-selected LLMCG (t=12.72, p<0.0001, d=0.915). Random-selected LLMCG vs. Control-Human also differed (t=2.91, p=0.0036); LLMCG subgroups did not differ significantly (p=0.4362). • Overall: Integrating causal graphs with LLMs yields hypotheses with human-comparable novelty and outperforms state-of-the-art LLM-only generation in novelty, while maintaining comparable usefulness.

Discussion

The findings show that coupling LLM semantic extraction with a causal knowledge graph systematically structures the literature’s causal claims and effectively guides exploration of latent relationships via link prediction. This directly addresses the research question: can AI-driven, graph-informed methods generate psychological hypotheses with quality rivaling human experts and exceeding LLM-only outputs? Results indicate parity with doctoral-level novelty and significant gains over LLM-only generation, while usefulness remains similar across groups. The causal graph mitigates LLM interpretability issues by constraining and clarifying inferred relations and focusing hypothesis generation on coherent, high-probability conceptual links. Deep semantic analyses further validate that LLMCG broadens conceptual incorporation and topic dispersion, aligning with human judgments of novelty. These outcomes suggest a scalable, interpretable pathway for automating early-stage psychological discovery and for channeling vast literature into concrete, testable hypotheses.

Conclusion

This work pioneers an integrated framework that extracts causal knowledge from 43k+ psychology papers using GPT-4, constructs a large-scale causal graph, and generates hypotheses via link prediction, with a focused evaluation on well-being. The LLMCG approach achieves human-comparable novelty and significantly surpasses LLM-only methods in novelty without sacrificing usefulness. Contributions include: (1) a specialized causal knowledge base for psychology enabling interpretable, systemic reasoning; (2) an automated hypothesis pipeline combining node embeddings and probabilistic link prediction with LLM hypothesis articulation; and (3) a multifaceted evaluation protocol blending expert ratings and deep semantic analysis. Future research should improve causal extraction precision, enrich concept standardization to reduce semantic redundancy, scale validation beyond 130 hypotheses across more domains, and develop semi-automated evaluation tools that align with expert judgments, thereby further reducing expert bottlenecks and enhancing reproducibility.

Limitations

Key limitations include: (1) extraction noise and inaccuracies—approximately 13% of GPT-4 extracted relationships did not align with expert assessments; (2) sparse and semantically overlapping concepts in the graph can blur distinctions and affect link prediction; (3) limited empirical validation—only 130 well-being hypotheses were evaluated despite a large conceptual space; (4) variability among expert raters reflects subjective judgment and introduces evaluation inconsistency; and (5) opacity of GPT-4’s internal mechanisms affects transparency and interpretability of extracted causal pairs. Addressing improved concept normalization, relation verification, and scalable, standardized evaluation protocols are important next steps.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

AI-AI bias: Large language models favor communications generated by large language models

W. Laurito, B. Davis, et al.

Computer Science

Accelerating materials language processing with large language models

J. Choi and B. Lee

Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

Chemistry

Autonomous chemical research with large language models

D. A. Boiko, R. Macknight, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny