Computer Science
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
C. Si, D. Yang, et al.
The study investigates whether current large language models can generate research ideas that are comparable to those produced by expert NLP researchers, with a focus on novelty, feasibility, excitement, expected effectiveness, and overall quality. Motivated by rapid LLM progress in scientific tasks (math problem solving, code generation, literature retrieval), the authors identify a gap: prior works have not demonstrated expert-level ideation by LLMs under rigorous, large-scale human evaluation. The paper frames ideation as the crucial first step toward autonomous research agents and emphasizes the need for carefully controlled, expert-driven evaluations due to the subjectivity and difficulty of judging research ideas. The authors design an evaluation-centric, controlled comparison between human experts and an LLM ideation agent, standardizing topics and writeup formats and conducting blind reviews by qualified researchers, to provide statistically rigorous evidence about LLM ideation capabilities.
The paper situates its contribution within three lines of work: (1) research agents and ideation systems that emphasize methods (e.g., iterative novelty boosting, multi-agent collaboration, retrieval and revision). These works typically rely on limited expert reviews, constrained idea formats, or LLM-as-a-judge rather than large-scale expert evaluations; (2) LLM support for other research tasks, including code generation for experiments, literature curation, review generation, and outcome prediction; and (3) computational creativity, where prior findings often show reduced creativity or diversity in LLM outputs and homogenization effects. The authors argue their evaluation-first, large-scale expert comparison fills a gap by providing standardized protocols, expert baselines, and robust statistical tests for ideation quality, specifically novelty, compared head-to-head with human experts.
Study design and controls: The authors define ideation evaluation as comprising (1) the idea, (2) its writeup, and (3) expert evaluation. To control confounders, they constrain topic space and standardize writeup style. Topics: Seven NLP prompting-related topics extracted from conference CFPs—Bias, Coding, Safety, Multilinguality, Factuality, Math, and Uncertainty. Human participants pick topics from this list; for each chosen topic, the LLM generates a matching-topic idea, equalizing topic distributions across conditions. Idea writeup standardization: Both humans and the LLM follow a grant-like template (title, problem statement, motivation, method, step-by-step experiment plan, test cases, fallback plan). An LLM-based style normalization module homogenizes writing style across all ideas; the first author manually verified that content was preserved. Review protocol: Blind reviews by expert researchers using a form adapted from major AI conferences, scoring on 1–10 for novelty, excitement, feasibility, expected effectiveness, and overall, with textual rationales. Reviewers selected topic preferences and loads; each idea received 2–4 reviews; assignments balanced across conditions and avoided same-institution conflicts. Conditions compared:
- Human Ideas: proposals written by recruited experts (N=49).
- AI Ideas: top-ranked ideas from an LLM agent.
- AI Ideas + Human Rerank: same LLM generations but top ideas manually reranked by a human expert (first author) to estimate an upper bound of AI idea quality. LLM ideation agent: Minimalist agent with three components.
- Retrieval (RAG): Using Claude-3.5-Sonnet, the agent issues function calls to Semantic Scholar (KeywordQuery, PaperQuery, GetReferences), retrieves up to N=120 papers (top-k=20 per call), and LLM-scores papers for relevance, empirical nature, and inspirational value; top-ranked papers are fed into ideation prompts.
- Idea generation: Generates 4000 seed ideas per topic using prompts with k=6 demonstration exemplars and randomly selected k=10 retrieved paper titles/abstracts for augmentation. Previously generated idea titles are appended to reduce repetition.
- Deduplication: Encode seed ideas via all-MiniLM-L6-v2 and remove duplicates using cosine similarity threshold 0.8; ~5% remain unique (≈200 of 4000), exposing diversity limits.
- Ranking: Scrape ~1200 ICLR 2024 LLM-related submissions with review scores/decisions. LLMs perform poorly at absolute score prediction but are better in pairwise judgments. Using Claude-3.5-Sonnet zero-shot pairwise ranker in a Swiss-style tournament (N=5 rounds) produces ranked lists. Sanity check: top-10 vs bottom-10 papers differ markedly in average ICLR scores. The top-ranked agent proposals constitute the AI Ideas. The AI Ideas + Human Rerank condition replaces the LLM ranker with a human reranker; 17/49 overlap with the AI Ideas set, showing ranking discrepancies. Human participants and quality controls: Recruited via community channels with IRB approval; screened US-based participants to have ≥1 major AI venue publication. Final: 49 idea writers (26 institutions), 79 reviewers (32 institutions); 24 served both roles, with self-review avoided. Compensation: writers $300 plus $1000 bonuses for top 5; reviewers $25/review; total 298 reviews. Writers reported topic familiarity (mean 3.7/5), difficulty (3.0/5), spent 5.5 hours on average; human proposals averaged ~902 words; AI proposals ~1174–1186 words. Reviewer stats: familiarity 3.7/5, confidence 3.7/5, ~32 minutes per review, ~232 words, with 80/298 reviews citing related papers. Statistical analysis: Three tests to assess robustness. Test 1 treats each review as an independent datapoint and uses Welch’s t-tests with Bonferroni correction. Test 2 averages reviews per idea (N=49/condition). Test 3 averages scores per reviewer and tests mean differences vs baseline (one-sample t-tests). A linear mixed-effects model accounting for topic, idea, and reviewer variance further corroborates findings. Additional analyses: Correlation among metrics; inter-reviewer agreement using a balanced top/bottom-25% split; evaluation of LLM-as-a-judge accuracy; qualitative coding of reviewer free-text rationales; examination of LLM generation diversity scaling.
- Main result: Across three statistical tests, AI-generated ideas receive significantly higher novelty scores than human expert ideas; feasibility is comparable to slightly lower for AI, while excitement and overall scores show some gains particularly when AI ideas are human-reranked.
- Test 1 (per-review analysis): Novelty means (1–10 scale): Human 4.84 (SD 1.79, n=119), AI 5.64 (SD 1.76, n=109, p<0.01), AI+Human Rerank 5.81 (SD 1.66, n=109, p<0.001). Excitement: AI (5.19, p<0.05), AI+Human (5.46, p<0.01) > Human (4.55). Feasibility: no significant differences (Human 6.61; AI 6.34; AI+Human 6.44). Overall: AI+Human (5.34, p<0.05) > Human (4.68); AI not significantly different (4.85).
- Test 2 (per-idea analysis, N=49 per condition): Novelty means: Human 4.86 (SD 1.26), AI 5.62 (SD 1.39, p<0.05), AI+Human 5.78 (SD 1.07, p<0.01). Other metrics not significantly different, though AI+Human trends higher in excitement and overall.
- Test 3 (per-reviewer differences): Mean novelty differences: AI−Human = +0.94 (N=70, p<0.01); AI+Human−Human = +0.86 (N=65, p<0.01). Excitement differences significant; feasibility differences not significant.
- Mixed-effects models: Positive, significant coefficients for AI conditions on novelty (AI: +0.756, p<0.05; AI+Human: +0.902, p<0.01). Feasibility coefficients negative but non-significant.
- Reviewer focus: Overall score correlates strongly with excitement (r=0.854) and novelty (r=0.725), weakly with feasibility (r=0.097), suggesting reviewers prioritize novelty/excitement over feasibility in overall judgments.
- Inter-reviewer agreement: 56.1% on balanced top/bottom-25% ranking, higher than chance (50%) but lower than NeurIPS’21 (66.0%) and ICLR’24 (71.9%), reflecting subjectivity in idea evaluation.
- LLM-as-a-judge performance: Lower than human agreement—Claude-3.5 pairwise 53.3%, Claude-3.5 direct 51.7%, GPT-4o direct 50.0%, GPT-4o pairwise 45.0%, AI Scientist reviewer 43.3%—underscoring unreliability of LLM evaluators for nuanced ideation.
- Diversity limit under inference-time scaling: Of 4000 seed ideas per topic, only about 200 are non-duplicates after deduplication (≈5%), and the non-duplicate fraction declines as generation proceeds, indicating saturation and limited idea diversity with naive over-generation.
- Qualitative themes: Reviewers often praise AI ideas for novelty but critique them for vagueness in implementation, misuse of datasets, missing baselines, unrealistic assumptions, resource demands, and incomplete motivation. Human ideas are typically more grounded/feasible but often more incremental.
Findings indicate that current LLMs, when embedded in a straightforward ideation agent with retrieval and large-scale sampling plus ranking, can produce ideas expert reviewers judge as more novel than typical expert-generated ideas written under time constraints. Gains in novelty correlate with increased excitement and, in some analyses, overall ratings, though feasibility tends to be slightly lower or comparable. The results suggest that LLM ideation could complement human expertise, particularly when combined with simple human-in-the-loop reranking, which boosts excitement and overall quality further. However, the study also surfaces important caveats: reviewer subjectivity is substantial; overall judgments prioritize novelty/excitement over feasibility; and LLM self-evaluation is currently unreliable, limiting purely automated over-generate-and-rank pipelines. Moreover, inference-time scaling faces diversity ceilings, as idea generation rapidly saturates with duplicates. These insights guide future research on building more diverse generation processes, robust evaluation pipelines (reducing reliance on LLM-as-judge), and hybrid human-AI workflows. The authors also discuss external validity and next steps: executing ideas to evaluate real project outcomes; extending beyond prompting-based NLP to other domains; and exploring end-to-end automation while ensuring faithful implementations. Ethical and sociotechnical considerations include risks of low-quality submission floods, ambiguity of intellectual credit, potential misuse, idea homogenization, and impacts on human collaboration norms.
The paper provides the first large-scale, controlled, expert-blind evaluation comparing LLM-generated research ideas to those written by expert NLP researchers. Under matched topics and standardized writeups, AI ideas are consistently judged significantly more novel than human ideas, with comparable feasibility and increased excitement in some settings—especially when a human expert reranks LLM outputs. The work contributes a rigorous evaluation protocol, human baselines, released agent implementation and review data, and an analysis of limitations in LLM generation diversity and self-evaluation. Future directions include: conducting an end-to-end execution study to tie novelty/feasibility judgments to actual research outcomes; improving LLM ideation diversity via better sampling/conditioning; developing trustworthy evaluation frameworks that reduce reliance on LLM judges; and extending the study design to other research domains to assess generality.
- Subjectivity and reviewer variance: Inter-reviewer agreement is modest (56.1%), lower than typical conference settings, reflecting the inherent subjectivity of idea evaluation without experimental results.
- Power for secondary effects: While novelty gains are robust, the study is underpowered to draw strong conclusions about feasibility, effectiveness, or overall differences beyond suggestive trends.
- Scope restriction: The study focuses on prompting-based NLP topics chosen to enable subsequent execution studies; conclusions may not generalize to other domains or resource-intensive research areas.
- Human idea quality baseline: Many human participants generated ideas within a 10-day window and self-reported that their submissions were around the top 43% of their ideas, which may underrepresent their best work.
- Style normalization: Although content preservation was manually checked, LLM-based style standardization could still introduce subtle changes.
- Ranking reliability: The LLM ranker is imperfect (discrepancies with human reranking), and LLM-as-a-judge accuracy is low, limiting fully automated selection pipelines.
- Diversity ceiling in LLM generation: Over-generation yields many duplicates (≈95% of 4000 seeds), constraining inference-time scaling benefits.
- Participant pool constraints: Recruitment was limited to US-based participants for logistical reasons, which may affect representativeness.
Related Publications
Explore these studies to deepen your understanding of the subject.

