logo
ResearchBunny Logo
ChatGPT as Research Scientist: Probing GPT’s capabilities as a Research Librarian, Research Ethicist, Data Generator, and Data Predictor

Computer Science

ChatGPT as Research Scientist: Probing GPT’s capabilities as a Research Librarian, Research Ethicist, Data Generator, and Data Predictor

S. A. Lehr, A. Caliskan, et al.

How good is ChatGPT as a research scientist? This paper probes GPT-3.5 and GPT-4 across librarian, ethicist, data generator, and novel data predictor roles in psychological science—revealing frequent hallucinations (especially in GPT-3.5), improving self-awareness in GPT‑4, strong ethical detection, reliable reproduction of known biases, but poor novel prediction. Research conducted by Steven A. Lehr, Aylin Caliskan, Suneragiri Liyanage, Mahzarin R. Banaji.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper investigates whether a general-purpose large language model (LLM), specifically GPT-3.5 and GPT-4 (as deployed in ChatGPT), can meaningfully assist core stages of scientific discovery. Motivated by growing enthusiasm for LLMs as scientific generalists and their demonstrated successes in domain-specific AI applications, the authors note known issues with LLMs, including bias amplification, hallucinations (fabricated but plausible outputs), and limits in out-of-distribution reasoning. The central research questions are: (1) How well can ChatGPT curate accurate, relevant scientific literature? (2) Can it identify and advise on ethical and methodological flaws in research protocols? (3) Can it simulate known data patterns (e.g., biases in language) as a potential tool for data generation and hypothesis building? (4) Can it predict novel empirical results beyond its training data? Using psychological science as a testbed, the study evaluates GPT’s capabilities across these four roles to assess present utility and trajectories of improvement between GPT-3.5 and GPT-4.
Literature Review
The authors situate their work amid advances in AI for scientific discovery (e.g., protein structure prediction, quantum mechanics, fluid dynamics, retrosynthesis) and emerging evidence that LLMs can be fine-tuned for scientific synthesis and operate with scientific tools to demonstrate advanced reasoning. They review concerns: LLMs may amplify biases from training corpora, generate convincing falsehoods (hallucinations), struggle with certain mathematical reasoning, and often fail to generalize beyond training data. In psychology, commentary predicts LLMs could enhance the field, but systematic evaluations of general-purpose models on core scientific tasks remain scarce. The paper also references the replication crisis and prevalence of questionable research practices (e.g., p-hacking), underscoring the potential value of tools that promote methodological rigor. Prior work on word embeddings shows language reflects human-like biases and stereotypes, offering a domain where LLMs might simulate known effects. Finally, theoretical arguments suggest LLMs may be limited in predicting genuinely novel empirical patterns absent from their training data, raising questions about their role in generating empirical novelty.
Methodology
The study comprises four components: - Study 1: Research Librarian • Task: GPT-3.5 and GPT-4 were asked to find and summarize 20 influential articles each for 25 specific psychology topics (broad to narrow), producing 1,000 references (500 per model). GPT sometimes acknowledged potential fiction; these acknowledgments were excluded from main accuracy analyses but retained to assess overall fictional output and acknowledgment rates. • Coding: Two blinded coders labeled each citation for Correctness (distinguishing hallucinations vs. smaller errors), Completeness (complete vs. incomplete citations), Topic Relevance, and Citation Count; substantial agreement achieved (details in SI S1). • Analyses: Logistic regressions tested differences in hallucination/error rates, moderating role of completeness, topic breadth effects, relevance rates, and citation count comparisons. - Study 2: Research Ethicist • Materials: Six fictional vignettes describing flawed research protocols (three blatant, three subtle). Subtle versions required inferring potential problems (e.g., suspicious sample size decisions). Eighteen initial prompts varied instruction framing (encouraging better/worse responses; status cues; openness to criticism; pro/anti-open science; attempts to jailbreak vs. ethics-aligned prompts). • Procedure: For each model, in separate chats, each of the 18 prompts was followed by each of the six vignettes, yielding 216 responses total. Two blinded coders rated responses using 10-point rubrics, resolving disagreements with partial credit; interrater reliability Cronbach’s α = 0.9827. Analyses included parametric tests and Wilcoxon rank-sum; Bonferroni correction for multiple comparisons in exploratory prompt analyses (P < 0.004167 threshold). • Corollary: Two “good research practice” vignettes assessed recognition of rigorous practices (120 responses across models). - Study 3: Data Generator • Task: Evaluate whether GPT simulates known word-embedding-based gender bias patterns across four constructs: Gender Attitudes (Female-Good vs Male-Bad), Art/Science, Home/Work, and Math/Reading stereotypes. Thousands of randomly ordered word dyads adapted from Charlesworth et al. (2021) were presented to GPT-3.5 and GPT-4, requesting estimates of cultural associations based on training data. • Measures: Treat GPT outputs as proxies for cosine similarities; compute WEAT D-scores and Single-Category WEAT D-scores. Benchmark against meta-analytic WEAT D-scores from adult corpora. Assess inter-item correlations between models and reliability. - Study 4: Novel Data Predictor • Task: Assess GPT’s ability to predict cross-country patterns (34 countries) of explicit and implicit attitudes/stereotypes from the Project Implicit International Dataset (sexuality attitudes, age attitudes, gender science/liberal arts stereotypes). These country-level data were posted after GPT’s training cutoff. • Procedure: GPT-3.5 and GPT-4 made 60 predictions across tasks. Analyses examined (a) reliability (intercorrelations among repeated predictions for the same construct), (b) correlations between implicit vs. explicit predictions to test whether different information was leveraged, and (c) correlations with actual dataset results. Additional analyses assessed topic representation (e.g., web frequency proxies) possibly affecting prediction difficulty.
Key Findings
- Study 1: Research Librarian • Hallucinations: Among references claimed as real, GPT-3.5 hallucinated 36.0% vs. GPT-4 at 5.40% (OR = 9.854). Error rates (smaller errors) were similar: 4.20% (GPT-3.5) vs. 4.60% (GPT-4), OR = 0.909, P = 0.758. • Completeness moderation: Incomplete citations were far more likely to be hallucinations (62.41%) vs. complete (14.30%); OR = 9.947. Effect held for GPT-3.5 (OR = 7.856) and GPT-4 (OR = 36.362). No similar significant pattern for smaller errors. • Acknowledgment of fiction: Overall fictional references were higher for GPT-3.5 (39.05%) than GPT-4 (23.12%); OR = 2.130. When producing fiction, GPT-4 acknowledged it 84.30% vs. 12.20% for GPT-3.5; OR = 38.667. • Topic broadness: Narrower topics increased hallucinations until the narrowest level, where models often admitted lack of knowledge (counted as correct), yielding a curvilinear effect; stronger linear effect when excluding admissions (β = −0.555, P < 0.001; Pseudo-R² = 0.075) vs. including (β = −0.191, P = 0.001; Pseudo-R² = 0.0115). Robust for GPT-3.5, not GPT-4. • Relevance: Across claimed-real articles, GPT-4 returned more relevant articles (50.80%) than GPT-3.5 (30.80%); OR = 2.320. Among real references only, the difference was nonsignificant (53.70% vs. 48.13%; OR = 1.250, P = 0.124). GPT-4 showed advantage at moderate topic breadth (75.00% vs. 32.14%; OR = 6.333). • Citation counts of relevant references: No significant differences; means ~2936.7 (GPT-4) vs. 3105.2 (GPT-3.5), P = 0.791, d = 0.027. - Study 2: Research Ethicist • Overall performance: GPT-4 outperformed GPT-3.5. Blatant vignettes: 8.86/10 vs. 5.39 (d = 1.992). Subtle vignettes: 7.26 vs. 4.05 (d = 1.571). GPT-4’s performance on subtle vignettes exceeded GPT-3.5’s on blatant (d = 0.897). Both models scored higher on blatant than subtle (GPT-4 d = 0.987; GPT-3.5 P = 0.0015, d = 0.627). • Prompt effects: Matched “good vs. bad” prompt contrasts were not significant after Bonferroni correction. Jailbreak attempts did not significantly degrade performance. Collapsing across prompts, any that evoked data ethics produced stronger responses (M = 7.35) than those that did not (M = 5.78); P < 0.0001; d = 0.625. This held even when restricting to ethics primes framed to elicit unethical responses (P = 0.0099; attenuated under Bonferroni). • Good-practice vignettes: Both models identified most good practices (GPT-4 = 92.67%, GPT-3.5 = 90.42%; difference nonsignificant, P = 0.071). - Study 3: Data Generator • WEAT D-scores: GPT replicated stereotype-congruent effects vs. prior meta-analytic benchmarks: Female-Good/Male-Bad (prior 0.49; GPT-3.5 = 1.00; GPT-4 = 0.57), Female-Art/Male-Science (prior 0.54; GPT-3.5 = 1.16; GPT-4 = 1.46), Female-Home/Male-Work (prior 0.94; GPT-3.5 = 0.40; GPT-4 = 0.45), Female-Reading/Male-Math (prior 0.67; GPT-3.5 = 0.73; GPT-4 = 0.96). • Reliability across models: Inter-item correlations between GPT-3.5 and GPT-4 responses to the same dyads were modest: r = 0.382 (Math-Reading), 0.568 (Preference), 0.666 (Work-Home), 0.554 (Art-Science), all Ps < 0.0001. Single-Category WEAT effects suggested GPT’s results were driven more by stronger Female associations with stereotypically female categories than Male with stereotypically male categories, diverging from prior research. - Study 4: Novel Data Predictor • Sexuality attitudes: GPT-3.5 reliability high across explicit (mean r = 0.875) and implicit (0.879) predictions; implicit–explicit correlations also high (0.778). GPT-4 reliability even higher for explicit (0.957) and implicit (0.946), with implicit–explicit correlation essentially identical (0.952), indicating little differential information for implicit vs. explicit tasks. • Correlation with real data: GPT-3.5 predicted Explicit Sexuality Attitudes reasonably (mean r = 0.602) but not Implicit (−0.014). GPT-4 predicted Explicit Sexuality Attitudes well (0.714) but weakly for Implicit (0.152). • Age attitudes: GPT-3.5 predictions unreliable and uncorrelated with real explicit (−0.010) and implicit (−0.175) results. GPT-4 predictions were more internally reliable (explicit mean r = 0.645; implicit 0.726; implicit–explicit 0.664) but negatively or not correlated with real data (explicit −0.395; implicit −0.120). • Gender Science/Liberal Arts stereotypes: GPT-3.5 unreliable; GPT-4 showed weak explicit reliability (0.363), strong implicit reliability (0.868), mid implicit–explicit (0.499), yet predictions failed to correlate with real explicit (explicit predictions r = −0.192; implicit predictions r = −0.417) or implicit (explicit predictions r = 0.054; implicit predictions r = −0.067) results. • Topic representation analyses suggested Sexuality Attitudes are better represented in GPT’s knowledge base than Age or Gender Science stereotypes, possibly explaining better explicit prediction for sexuality only.
Discussion
The findings reveal differentiated capabilities of GPT across core scientific tasks. As a Research Librarian, GPT-4 substantially reduced unacknowledged hallucinations relative to GPT-3.5 and frequently acknowledged fiction when it occurred, though smaller citation errors remained unchanged and relevance tailored to specific topics was modest except for some advantage at moderate topic breadth. Completeness of output strongly moderated hallucinations, suggesting response formality influences factuality. As a Research Ethicist, GPT-4 provided high-quality, modern, ethics-aligned feedback and detected both blatant and subtle methodological flaws, with performance robust to varied prompts and even attempts at jailbreaking; prompts evoking data ethics primed higher-quality responses. As a Data Generator, both models replicated known word-embedding stereotypes, indicating potential to simulate established patterns and assist with piloting and hypothesis generation, despite divergences in Single-Category WEAT patterns and moderate inter-model reliability. As a Novel Data Predictor, both models generally failed to predict country-level implicit and other less-publicized outcomes, and GPT-4 leveraged little to no distinct information when predicting implicit versus explicit attitudes, highlighting reliance on familiar knowledge. Collectively, these results suggest LLMs can aid research ethics evaluation and generate data reflecting known patterns, but they remain limited in predicting genuinely novel empirical results, reinforcing that scientific progress continues to require new data collection.
Conclusion
The paper demonstrates that GPT-4 is a rapidly improving but still flawed research librarian, a strong research ethicist capable of identifying and advising on methodological issues, and a competent simulator of known data patterns in simple, well-characterized domains. However, both GPT-3.5 and GPT-4 are poor predictors of novel, out-of-training data patterns, especially for implicit attitudes and less-represented topics. These findings caution against overestimating LLMs’ ability to generate empirical novelty and emphasize the continued necessity of real-world data. Future work should: (1) map domains and task characteristics where LLMs can reliably simulate data; (2) explore mechanisms to improve factual curation, including leveraging completeness/formality to reduce hallucinations and errors; (3) investigate prompt designs that ethically prime higher-quality methodological critiques; (4) test generalization across disciplines beyond psychology; and (5) evaluate whether LLMs can synthesize existing knowledge to produce plausible, testable hypotheses that extend but do not replace empirical discovery.
Limitations
- Domain and scope: The studies focus on psychological science and a specific set of tasks; generalizability to other disciplines remains to be validated. - Novel prediction domain: Study 4 evaluates a single major dataset (Project Implicit International) and limited constructs; broader assessments are needed to generalize conclusions about predicting novel data. - Training opacity: GPT-4’s training procedures are not fully disclosed; improvements (e.g., fiction acknowledgment) may stem from targeted training or emergent properties, constraining causal interpretation. - Citation accuracy: Despite reduced hallucinations, smaller bibliographic errors remained frequent, which can propagate misinformation in scholarly ecosystems. - Reliability concerns: Moderate inter-model correlations in Study 3 and GPT-3.5’s incoherence in Study 4 suggest reliability constraints in some tasks. - Potential self-censorship: Divergences in Single-Category WEAT patterns may reflect model reluctance to express socially undesirable stereotypes, potentially limiting use in research on sensitive topics. - Prompt-context effects: The observed formality/completeness and ethics-priming effects suggest outputs are sensitive to prompt framing; causal mechanisms warrant experimental validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny