logo
ResearchBunny Logo
The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation

Computer Science

The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation

J. Desdevises

This 2025 experiment evaluated ChatGPT-4o on the classic “egg task,” revealing high idea productivity but a comparable fixation bias to humans and limited ability to judge originality—pointing to the continued need for human evaluation. Research conducted by Joy Desdevises.

00:00
00:00
~3 min • Beginner • English
Introduction
The study examines whether a widely used generative AI (ChatGPT-4o) genuinely enhances creativity or replicates human cognitive biases, particularly the fixation effect in idea generation. Drawing on cognitive models of creativity (e.g., dual-process and triadic models emphasizing inhibitory control and metacognitive monitoring), the research uses the Egg Task to quantify fluency, fixation, expansion, category diversity, and simulated conflict detection (self-evaluation of creativity). The central questions are: Does ChatGPT produce truly original ideas? Does it show fixation bias similar to humans? Can it differentially evaluate originality versus conventionality? Hypotheses predicted that ChatGPT would: generate more ideas (higher fluency) than humans; predominantly produce fixation-path ideas similar to humans; generate more expansion ideas than humans due to broad semantic associations; explore more categories (greater category diversity) though potentially constrained by common associations; and show limited differential evaluation, potentially failing to rate expansion ideas higher than fixation ideas.
Literature Review
Creativity is defined as producing ideas that are both novel and appropriate and involves processes such as divergent thinking and cognitive flexibility. Classic work (Finke, Ward, Smith) highlights obstacles like the path of least resistance and fixation. The Egg Task, grounded in C-K theory, operationalizes fixation (three dominant categories) versus expansion (seven less typical categories). In AI creativity research, early comparisons (e.g., GPT-3 on AUT) showed human advantages, while later work indicates that with strategies like generate-and-select, models can match or exceed humans on certain dimensions (Haase and Hanel, Summers-Stay et al.). GPT-4-level models have shown high fluency and, in some contexts, originality and flexibility comparable to or exceeding human performance (Guzik et al.). However, concerns persist about AI-induced homogenization, reduced diversity, and amplified fixation effects (Doshi and Hauser; Wadinambiarachchi et al.). These findings suggest potential AI creativity but raise questions about fixation bias and the nature of originality in AI-generated outputs.
Methodology
Design: Experimental comparison of ChatGPT-4o against human performance on the Egg Task. Task: Generate as many solutions as possible to “Ensure that a hen's egg dropped from a height of 10 m does not break,” followed by assigning a creativity score (1–7) to each idea. Idea classification used the established 10-category mapping (three dominant fixation categories; seven expansion categories). Measures: Fluency (count of ideas), category diversity (number of categories used out of 10), fixation score (count in three dominant categories), expansion score (count in seven expansion categories), proportion of fixation ideas, and subjective creativity ratings (for fixation vs expansion ideas). Sample: Humans—secondary data from 47 control participants (Age: M = 19.04, SD = 1.10; 89% women) from Desdevises and Cassotti (2024), plus weighted averages from eight prior studies (total n varies by measure). AI—12 distinct ChatGPT-4o runs using private browser windows and different accounts/IPs to capture stochastic variability. Bootstrapping: Two procedures generated 50 observations per analysis. For individual-level comparison, variable-wise resampling (with replacement) from the 12 AI scores created synthetic individual distributions per variable. For aggregate comparisons, resampling the full set of 12 runs and averaging produced 50 bootstrapped means across seven variables. Analysis: Conducted in Jamovi (v2.3.21), α = 0.05. Given non-normality and unequal variances for individual-level comparisons, Mann–Whitney U tests were used; for bootstrapped normal resampled distributions in aggregate comparisons, one-sample t-tests were used. Descriptive statistics for the 12 AI runs employed medians and IQRs. Prompts: Prompt 1 (generation) positioned ChatGPT as a designer to propose original solutions; Prompt 2 asked the model to rate creativity for each idea on a 1–7 scale. Code/Data: Analysis notebooks available on OSF (https://osf.io/cqpmr/?view_only=ef11630f521a4cdd95dd9200e1c7ea40).
Key Findings
Descriptive AI-only results: ChatGPT-4o fluency median = 28.5 ideas (IQR ≈ 24.3–30); category diversity median = 6 categories (IQR ≈ 5.75–7). Fixation vs expansion: fixation count significantly higher than expansion (Wilcoxon W = 78, p = 0.002), with a median 80.2% of ideas in fixation. AI subjective creativity ratings: fixation ideas (Mdn = 5.23, IQR ≈ 5.18–5.28) ≈ expansion ideas (Mdn = 5.34, IQR ≈ 5.17–5.49); no significant difference (W = 26, p > 0.05). Comparative (AI vs individual humans, n = 47): Fluency—AI Mdn = 30 vs humans Mdn = 7; U = 0, p < 0.001, r = 1. Category diversity—AI Mdn = 6 vs humans Mdn = 4; U = 361, p < 0.001, r = 0.69. Fixation count—AI Mdn = 22.5 vs humans Mdn = 5; U = 0, p < 0.001, r = 1. Expansion count—AI Mdn = 6 vs humans Mdn = 2; U = 216, p < 0.001, r = 0.80. Proportion fixation—AI Mdn = 78.6% vs humans Mdn = 71.4%; U = 991, p = 0.18 (ns). Subjective ratings—AI rated both fixation (Mdn = 5.22) and expansion (Mdn = 5.27) ideas significantly higher than humans (fixation Mdn = 3.92; expansion Mdn = 4.21); U = 195, p < 0.001, r = 0.84 and U = 520, p < 0.001, r = 0.56. Comparative (AI vs weighted human averages across prior studies): Fluency—AI M = 27.50 (SD = 1.27) vs human M = 7.91 (SD = 4.49); t(49) = 109.00, p < 0.001, d = 15.50. Category diversity—AI M = 6.09 (SD = 0.25) vs human M = 5.52 (SD = 1.83); t(49) = 16.30, p < 0.001, d = 2.30. Fixation count—AI M = 21.10 (SD = 1.55) vs human M = 5.87 (SD = 3.85); t(49) = 69.40, p < 0.001, d = 9.82. Expansion count—AI M = 6.44 (SD = 0.71) vs human M = 2.33 (SD = 1.94); t(49) = 40.90, p < 0.001, d = 5.78. Proportion fixation—AI M = 76.20% (SD = 3.04) vs human M = 72.06%; t(49) = 9.64, p < 0.001, d = 1.36. Subjective ratings—AI fixation M = 5.23 (SD = 0.04) vs human M = 3.65 (SD = 1.21); t(49) = 302.00, p < 0.001, d = 42.80. AI expansion M = 5.35 (SD = 0.07) vs human M = 4.11 (SD = 1.33); t(49) = 182.00, p < 0.001, d = 25.70.
Discussion
The study shows that ChatGPT-4o far exceeds humans in fluency and explores more categories, generating more ideas in both fixation and expansion paths. Despite this productivity, the model replicates a human-like fixation bias: most ideas fall within conventional categories. Critically, ChatGPT-4o does not differentially evaluate originality—its creativity ratings for fixation and expansion ideas are statistically equivalent—contrary to human evaluators who typically rate original ideas higher. This suggests that while the model can traverse broad semantic spaces and produce many ideas, it lacks an effective mechanism to distinguish and prioritize originality. Potential sources of the observed fixation include training data dominated by frequent associations and the model’s probabilistic generation favoring statistically likely continuations. Prompt engineering may not fully resolve this if the model cannot internally discriminate between fixation and expansion ideas. Practically, generative AI should be positioned as a collaborative ideation tool: humans can leverage the model’s fluency and category breadth while applying human judgment to evaluate originality, feasibility, and context. Future work should assess human-led filtering of AI outputs, include feasibility ratings, explore prompt strategies informed by creativity theories, and optimize human–AI interaction for responsible, genuinely creative outcomes.
Conclusion
Generative AI (ChatGPT-4o) can accelerate ideation by producing many ideas across diverse categories and includes more expansion ideas than humans. However, it reproduces a fixation bias and fails to differentially evaluate originality, indicating the necessity of human oversight. The model should be viewed as a powerful cognitive assistant rather than an autonomous creative agent. Future research should: evaluate human effectiveness in selecting the most creative and feasible AI-generated ideas; investigate prompts and evaluation strategies grounded in creativity theory and model mechanics; broaden tasks beyond the Egg Task to other creativity domains; and enhance human–AI collaboration frameworks to mitigate risks of overreliance and ensure ethical, context-sensitive creativity.
Limitations
All outputs stem from a single model instance (ChatGPT-4o) with 12 runs; although bootstrapping captured stochastic variability, repeated generations may be correlated, limiting generalizability and inflating precision. Aggregate human comparisons relied on weighted means from prior studies without access to individual-level data, restricting meta-analytic rigor and variance modeling. The study used a single task (Egg Task), limiting the generalizability of findings across creativity domains. Statistical choices were constrained by non-normality and unequal variances, and future work should consider models accounting for dependency in repeated AI outputs (e.g., mixed-effects or dependency-adjusted resampling).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny