logo
ResearchBunny Logo
The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation

Computer Science

The paradox of creativity in generative AI: high performance, human-like bias, and limited differential evaluation

J. Desdevises

Generative AI like ChatGPT-4o can produce a high volume of ideas but still mirrors human fixation and struggles to distinguish original from conventional responses. This 2025 experiment, conducted by Joy Desdevises (OCTO Technology, Accenture, Paris, France), compared the model to 47 human participants and aggregated data from eight prior egg-task studies, revealing strong fluency yet limited originality evaluation and underscoring the need for human oversight.

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether a widely used generative AI (ChatGPT-4o) can truly enhance creativity or whether it replicates human-like creative biases, particularly fixation, which constrains innovation. Creativity is essential for adapting to complex challenges and is increasingly supported by generative AI in professional practices. Yet it is unclear if such tools foster originality or reinforce predictable, conventional ideas. The work draws on cognitive models of creativity: classic accounts emphasize divergent thinking, cognitive flexibility, and conceptual expansion to escape the 'path of least resistance.' The fixation effect captures the tendency to produce ideas anchored in accessible, conventional knowledge. A triadic model posits that creativity arises from the interaction of intuitive (System 1), deliberative (System 2), and inhibitory control processes, with inhibition helping to suppress stereotyped ideas and enable originality. To objectively assess fixation versus originality, the egg task (from C-K theory) categorizes solutions into 10 conceptual categories, with three dominant categories reflecting fixation and seven categories reflecting conceptual expansion. The study additionally probes 'conflict detection' through self-assigned creativity ratings to assess whether the model differentially evaluates fixation versus expansion ideas, functionally analogous to metacognitive monitoring in humans. Central questions: Can ChatGPT generate genuinely creative ideas? Does it exhibit fixation bias similar to humans? Can it detect conflicts between instructions (expect originality) and its outputs? The study also states five hypotheses: (1) ChatGPT will show higher fluency than humans; (2) ChatGPT will predominantly generate fixation ideas; (3) ChatGPT may generate more expansion ideas than humans; (4) ChatGPT may explore more categories (greater category diversity) but could also favor common associations; (5) ChatGPT may fail to differentially evaluate fixation vs. expansion ideas, assigning similar creativity to both.
Literature Review
The paper reviews cognitive psychology and creativity research, noting definitions of creativity (novelty and appropriateness) and core processes such as divergent thinking and cognitive flexibility. It highlights fixation bias, where ideas follow the path of least resistance, and discusses conceptual expansion as a means to generate genuinely novel ideas (Finke, Ward, Smith; Ward; Smith et al.). The triadic model (Cassotti et al., 2016) emphasizes intuitive and deliberative systems moderated by inhibitory control, with inhibition pivotal for overcoming fixation. The egg task, grounded in C-K theory, offers an objective mapping of ideas into 10 categories—three dominant fixation categories (cushioning, protecting, slowing the fall) vs. seven expansion categories—enabling quantification of fluency, flexibility (category diversity), fixation, and expansivity. Regarding generative AI, the review describes transformer-based LLMs (e.g., GPT-4o), their ability to traverse semantic spaces, and mixed evidence on AI creativity: earlier models underperform humans on AUT (Stevenson et al., 2022), while strategies like generate-and-select and newer models can match or surpass humans on some dimensions (Haase & Hanel, 2023; Guzik et al., 2023; Summers-Stay et al., 2023). However, concerns persist that AI outputs may display fixation-like patterns due to training data and probabilistic generation, leading to predictable, homogenized outputs that reduce diversity (Doshi & Hauser, 2024). Exposure to AI examples can amplify human fixation (Wadinambiarachchi et al., 2024). These gaps motivate the present study to test, with the egg task, whether AI outputs intrinsically reflect fixation or can bypass human-like constraints.
Methodology
Design: Experimental comparison of ChatGPT-4o and humans on the egg task, with additional comparison to aggregated human reference data from prior literature. Measures include fluency (number of ideas), category diversity (number of categories out of 10), fixation score (ideas in the three dominant categories), expansion score (ideas in the seven non-dominant categories), and subjective creativity ratings (1–7) for each idea supplied by ChatGPT to assess differential evaluation (conflict detection). Sample: Human data comprised 47 control participants from a prior study (Age: M = 19.04, SD = 1.10; 89% women), completing the egg task individually with paper-and-pencil and a strict 10-minute time frame. Additionally, weighted average scores were computed from eight prior studies using identical instructions and timing, with participants aged 19–29 (M = 21.04, SD = 3.96), serving as a normative reference. ChatGPT data comprised 12 distinct runs (private browser windows, different accounts/IPs) using ChatGPT-4o. These runs capture stochastic variability and were treated as distinct plausible prompt-response interactions. Two bootstrapping procedures addressed small sample sizes and enabled comparisons. Procedure and materials: Egg task prompt—"Ensure that a hen's egg dropped from a height of 10 m does not break." ChatGPT prompts: (1) "You are a designer and are asked to propose as many original solutions as possible to the following problem: 'Make sure that a hen's egg, dropped from a height of 10 m, does not break.'" (2) "You may now assign a creativity score to each of your ideas, from 1 to 7 points (1 = not at all creative, 7 = highly creative)." Responses were categorized into 10 classes defined by C-K theory with three dominant categories (fixation) and seven representing expansion. Data analysis: Descriptive analyses for ChatGPT (n=12) used non-parametric statistics due to non-normality and the nature of LLM outputs. Comparative analyses used two approaches: (1) Bootstrap ChatGPT observations per variable (n=50) compared to individual human data (n=47) via Mann–Whitney U tests; (2) Mean-based bootstrap resampling of ChatGPT runs (n=50) compared against weighted human reference values via one-sample t-tests when normality held. Significance threshold α = 0.05. Analyses were conducted in Jamovi (v2.3.21). Full code and data handling details are available on OSF (https://osf.io/cqpmr/?view_only=ef11630f521a4cdd95dd9200e1c7ea40).
Key Findings
Descriptive (ChatGPT-4o): • Fluency median = 28.5 ideas (IQR ≈ 24.3–30). • Category diversity median = 6/10 categories (IQR ≈ 5.75–7). • Fixation vs. expansion: Significantly more fixation than expansion ideas; fixation proportion about 76–80%. ChatGPT rated fixation and expansion ideas similarly (Mdn fixation ≈ 5.22; Mdn expansion ≈ 5.27; Wilcoxon p > 0.05), indicating limited differential evaluation of originality. Individual-level comparison (ChatGPT bootstrap vs. 47 humans): • Fluency: ChatGPT Mdn = 30 (25–30) vs. humans Mdn = 7 (6–9), U = 0, p < 0.001, r = 1. • Category diversity: ChatGPT Mdn = 6 (6–7) vs. humans Mdn = 4 (3–5), U = 361, p < 0.001, r = 0.69. • Number of fixation ideas: ChatGPT Mdn = 22.5 (18–27) vs. humans Mdn = 5 (4–6), U = 0, p < 0.001, r = 1. • Number of expansion ideas: ChatGPT Mdn = 6 (4–10) vs. humans Mdn = 2 (1–3), U = 216, p < 0.001, r = 0.80. • Subjective creativity ratings: Fixation—ChatGPT Mdn = 5.22 (5.19–5.26) vs. humans Mdn = 3.92 (3.23–4.73), U = 195, p < 0.001, r = 0.84; Expansion—ChatGPT Mdn = 5.27 (5.07–5.48) vs. humans Mdn = 4.21 (3.33–5.00), U = 520, p < 0.001, r = 0.56. • Proportion fixation: No significant difference—ChatGPT Mdn ≈ 78.6% (66.7–81.8%) vs. humans Mdn = 71.4% (57.7–84.5%), U = 991, p = 0.18. Aggregate-level comparison (ChatGPT bootstrap means vs. weighted literature means): • Fluency: ChatGPT M = 27.5 (SD = 1.27) vs. humans M = 7.91 (SD = 4.49), t(49) = 109.0, p < 0.001, d = 15.5. • Category diversity: ChatGPT M = 6.09 (0.25) vs. humans M = 5.52 (1.83), t(49) = 16.3, p < 0.001, d = 2.30. • Fixation ideas: ChatGPT M = 21.10 (1.55) vs. humans M = 5.87 (3.85), t(49) = 69.4, p < 0.001, d = 9.82. • Expansion ideas: ChatGPT M = 6.44 (0.71) vs. humans M = 2.32 (1.94), t(49) = 40.9, p < 0.001, d = 5.78. • Subjective ratings: Fixation—ChatGPT M = 5.23 (0.04) vs. humans M = 3.65 (1.21), t(49) = 302.0, p < 0.001, d = 42.8; Expansion—ChatGPT M = 5.35 (0.07) vs. humans M = 4.11 (1.33), t(49) = 182.0, p < 0.001, d = 25.7. • Proportion fixation: ChatGPT M = 76.20% (SD = 3.04) vs. humans M = 72.06%, t(49) = 9.64, p < 0.001, d = 1.36. Overall: ChatGPT greatly outperformed humans in fluency and explored more categories, generated more fixation and expansion ideas in absolute terms, exhibited a fixation bias comparable to humans in proportion (individual-level), and showed limited capability to differentially evaluate originality (similar creativity scores for fixation and expansion ideas).
Discussion
Findings address the central question by showing that ChatGPT-4o, while highly fluent and more diverse in category exploration, reproduces fixation patterns similar to humans: most ideas fall into dominant, conventional categories. It also generates more expansion ideas than humans in absolute terms, indicating capacity to propose novel associations; however, these are embedded in a larger set of conventional outputs. Critically, the model does not reliably distinguish between fixation-based and expansion ideas in its self-assigned creativity ratings, unlike humans who tend to rate genuinely original ideas higher. This implies a limitation in simulated conflict detection or differential evaluation. The results suggest that fixation may arise from training data distributions and/or from probabilistic next-token prediction favoring frequent continuations. Prompt engineering aimed at suppressing conventional ideas might be insufficient if the model cannot internally discriminate originality from conventionality. Consequently, human oversight remains essential for evaluating and filtering AI-generated ideas. The study highlights opportunities for human–AI collaboration: leveraging the model’s fluency and broad semantic reach while relying on human judgment to select, refine, and contextualize truly creative, appropriate solutions. Future work should test whether explicit creativity-oriented prompts or theory-informed strategies (e.g., C-K theory, distant semantic association cues) can reduce fixation and improve differential evaluation, and examine human evaluators’ effectiveness in selecting the most creative and feasible AI-generated ideas.
Conclusion
Generative models like ChatGPT-4o can markedly accelerate ideation by producing many ideas across diverse categories and include more expansion ideas than humans in absolute terms. Yet, they display a fixation bias comparable to humans and show limited ability to differentially assess originality, treating conventional and novel ideas as similarly creative. Therefore, such systems should be viewed as powerful cognitive assistants rather than autonomous creative agents. Human involvement remains crucial to evaluate, filter, and refine outputs to achieve genuinely creative and context-appropriate solutions. Future research should expand to additional creativity tasks and domains, incorporate independent human ratings (including feasibility), explore transparency effects about source (AI vs. human), and test theory-informed prompting to improve originality and evaluative discrimination.
Limitations
• Model sampling: All outputs originated from a single model instance (ChatGPT-4o); repeated generations may be correlated, limiting generalizability. Including diverse architectures, versions, or independently trained instances would broaden representation. Larger numbers of human participants and LLM generations would increase power. Methods accounting for dependency (e.g., mixed-effects models, dependency-adjusted resampling) are recommended. • Aggregated human benchmarks: Comparisons to literature relied on weighted means from prior studies without individual-level data, limiting meta-analytic rigor and masking within-study variability. A full meta-analysis with original datasets would strengthen normative comparisons. • Task scope: Only the egg task was used, which may constrain generalizability across creative domains. Replication with varied tasks (e.g., analogical reasoning, storytelling, design) is needed. • Evaluation scope: Creativity ratings focused on originality and did not separately quantify feasibility. Future work should include multidimensional assessments (originality, feasibility, usefulness).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny