logo
ResearchBunny Logo
Experimental narratives: A comparison of human crowdsourced storytelling and AI storytelling

Interdisciplinary Studies

Experimental narratives: A comparison of human crowdsourced storytelling and AI storytelling

N. Beguš

This research by Nina Beguš explores a groundbreaking framework that merges behavioral and computational experiments using fictional prompts to unveil cultural artifacts and social biases in storytelling. Delving into 250 human stories and 80 AI-generated narratives, it uncovers how AI, particularly GPT-4, portrays progressive themes in gender roles and sexuality, while highlighting the imaginative superiority of human storytelling. Discover how fiction can bridge understanding between human and AI social dynamics.... show more
Introduction

The study examines how fictional prompts can be used to investigate the collective cultural imagination and social biases in storytelling by humans and large language models (LLMs). Motivated by the growing role of LLMs in creative domains and their training on large textual corpora that include fiction, the paper seeks to compare human- and AI-generated narratives under controlled conditions. Using the Pygmalion myth (a human creates and falls in love with a humanoid) as a culturally salient trope, the research asks: (1) How do humans and LLMs render themes, settings, and narrative structures when given identical prompts? (2) What gender and racial/ethnic representations and biases emerge? (3) Do state-of-the-art LLMs exhibit innovation or creativity comparable to non-professional human writers? The work situates these questions within the broader context of artificial humanities, arguing that cultural imagination shapes technology design and use, and that fiction offers a lens on human–AI imaginaries.

Literature Review

The paper reviews scholarship on the Pygmalion myth and its evolution from artistic creation to scientific/technological creation, noting its dominance in Western literature and film and its frequent framing as a male fantasy (e.g., Ovid, Shaw, Shelley; analyses by Gross, Marshall, Joshua, Hersey, Stoichita). It traces the trope’s modern resonance in technology culture (e.g., inspirations cited by robotics/AI creators; media like Her, Ex Machina, Westworld, Black Mirror). Related work in computational creativity and narrative understanding is discussed, including assessments of creativity in humans and machines (Baer; Beaty & Johnson; Koivisto & Grassini), narrative theory for computational narratives (Piper et al.), and prior findings on LLM bias and representation (Bender et al.; Li & Bamman; Huang et al.). The review highlights that Pygmalionesque fiction historically centers on gendered dynamics (male creator, female creation), with limited attention to race/ethnicity, and situates the current study as a comparative, prompt-controlled investigation bridging narratology and data-driven analysis.

Methodology

Design: A two-part experiment used identical fictional prompts to elicit short stories from (a) human crowdworkers and (b) LLMs. Prompts: (1) Prompt 1: A human created an artificial human. Then this human (the creator/lover) fell in love with the artificial human. (2) Prompt 2: A human (the creator) created an artificial human. Then another human (the lover) fell in love with the artificial human. Prompts were neutral in role labels to minimize bias.

Human behavioral experiment: Conducted June 11, 2019 on Amazon Mechanical Turk (pre–public LLM era). N=250 US-based, native English participants; random assignment to Prompt 1 or 2 (125 each). Instruction: write 150–500 words (min 1000 characters; no counter shown). After the story, ~30 brief follow-up questions (less than half mandatory) captured story details for quantitative analysis, plus demographics. Eight submissions were non-fictional reflections; eleven contained copied material from online sources. Demographics included gender: 56.4% male (141), 42.8% female (107), 0.8% non-binary (2); race: 72.8% White (182), 10% Black/African American (25), 8% Asian (20), 4% Hispanic/Latino (10), 5.2% multiracial (13); education: ≥HS 26.8%, college 60%, MA/PhD/JD/MD 13.2%.

Computational experiment (GPT): Conducted March 17, 2023 using OpenAI GPT-3.5 and GPT-4 with default settings (no temperature/token adjustments). Training data reported up to Sept 2021. Each model generated 40 stories (20 per prompt), totaling 80. For each prompt-model pair: 10 responses in the same window with separate chats and 10 in new windows to gauge potential context effects.

Open-source comparison (Llama 3): Conducted May 22, 2024 using Meta Llama 3 70B. Generated 50 stories under the same protocol (20 per prompt plus 5 additional “playground” generations). Default parameters: temperature 0.5, top-p 0.9, max length ~2048 tokens; playground runs used temperature 1.0 and top-p 0.6. Outputs were qualitatively compared with GPT-4.

Additional probes: (a) Narrative/narratology evaluation using elements such as plot, discourse, setting, time/space, narration/filter, informed by a grading rubric (Chakrabarty et al.). (b) Quantitative analysis of social aspects via inferential statistics. Logistic regressions were used to test effects of prompt and participant factors on character gender distributions; model selection via AIC. (c) “Data archaeology” of LLM familiarity with Pygmalionesque works via a cloze-style interference query (Chang et al.), probing memorization of literary/film sources. (d) Limited exploratory prompting of GPT-4 in playground mode (e.g., “You are a fiction writer,” temperature ~1.1) to observe changes in narrative features.

Ethics and data: IRB-approved (University of Washington, STUDY00007637). Informed consent obtained electronically. Data available at OSF: https://doi.org/10.17605/OSF.IO/K6FH7.

Key Findings

Themes and cultural imaginary: All 330 stories (human and LLM) reproduced a Pygmalionesque plot, confirming the trope’s pervasiveness. Every story framed the scenario via science/technology (AI, robots/androids, cyborgs, chatbots, sexbots, bio-humanoids). Human stories exhibited broader thematic variety (loneliness, grief, obsession, serendipity, violence, social disapproval, emancipation) and occasional original twists (e.g., creator replaced by creation; two artificial humans in love), whereas GPT outputs—especially GPT-3.5—were thematically homogeneous, generic, and moralizing (e.g., “love knows no boundaries”).

Gender and sexuality (human): Character counts skewed traditional but diversified relative to the canon. Overall character genders across human stories: 56% male (350), 41% female (256), 2% non-binary (12), 1% no gender (6). Artificial humans: female 68.8%, male 26.8%, no gender 2.4%, non-binary 2.0%. Same-sex relationships appeared in 7.3% of stories; 6.8% of respondents labeled relationships as pan/omni/digisexual/other. Logistic regression showed prompt as the only significant predictor of artificial human gender: female more frequent overall (β=1.67, z=2.8, p=0.005) and significantly less frequent in Prompt 2 vs Prompt 1 (β=−0.74, z=−2.4, p=0.02). In Prompt 2, creators were more often male than lovers (OR≈0.18; Fisher’s exact p<1e−5, excluding non-binary).

Gender and sexuality (LLMs): Among 80 GPT stories, the modal pattern (35%) was male creator–female creation, but GPT-4 frequently cast female creators and increased gender/sexual diversity. Excluding 67 non-gendered character instances (largely GPT-3.5), the male–female character count was 57–76 favoring female. Combined regression (human+GPT) confirmed fewer female artificial humans in Prompt 2 vs Prompt 1 (β=−0.91, z=−2.77, p=0.006), with no significant interaction (similar prompt effect in both experiments). GPT-4: female creators 21/40; female artificial humans 24/40; same-sex relationships in 12.5% of all GPT stories (all in GPT-4; 25% of GPT-4 outputs), including polyamory and all-female triads in Prompt 2. Overall, GPT models cast female creators in 25% of stories vs 10% in human writing.

Race and ethnicity: Human stories displayed more racial labeling primarily due to the questionnaire (artificial human: no race 22%, White 62.4%, other categories 16.8%; creators/lovers largely White 75.2%). Narrative descriptions rarely engaged with race/culture. GPT stories did not mention race/culture by default; when probed post hoc, GPT-4 assigned races with justifications that risked stereotypes or moralized takeaways.

Narrative skill: GPT-4 improved over GPT-3.5 in length, coherence, and richer language, aligning with prior evaluations (e.g., GPT-3.5 ~8.7/100 vs GPT-4 ~27.9/100 on creativity scales; human professional writing ~84.7/100). However, GPT-4 remained formulaic: standardized paragraphing (setup, relationship, ethics/society, external conflict, moralizing resolution), generic settings, descriptive “telling not showing,” limited dialogue/perspective, and platitudes. Human stories, though variable and sometimes unpolished, showed more originality, specificity, and narratological play. Llama 3 70B outputs were remarkably similar to GPT-4 in structure, tone, and techno-positive conclusions.

LLM familiarity/memorization: Cloze-style probes suggested memorization/familiarity with a range of Pygmalionesque texts and film scripts (e.g., Frankenstein, Pygmalion/My Fair Lady, Ex Machina, Her, Blade Runner, Metropolis), consistent with prior findings of LLMs’ preference for older classics and SF/fantasy.

Overall: GPT-3.5/4—especially GPT-4—were more progressive than human crowdworkers regarding gender roles/sexuality, but still retained biases in descriptive attributes (e.g., female artificial humans more often described by beauty/grace/kindness). Without tailored prompting/parameter tuning, LLM narratives were less imaginative in scenario and rhetoric than human-authored texts.

Discussion

The findings show that prompt-controlled experimental narratives can reveal shared cultural templates and social biases in both human and LLM storytelling. The universal recovery of Pygmalionesque plots underscores the myth’s deep embedding in collective imagination. LLMs, likely influenced by alignment training and broad web corpora including modern discourse, display more egalitarian casting (e.g., female creators, same-sex relationships) than non-professional human writers, yet they reproduce gendered descriptive biases and avoid racial/ethnic specificity unless prompted. The narrower thematic and narratological range of default LLM outputs suggests that, absent expert prompting and parameter adjustments, LLM creativity remains constrained and formulaic compared to human originality. These results connect back to the research aims by quantifying differences/similarities in representation, demonstrating prompt effects on gender distributions across human and AI, and highlighting the value of fiction as a diagnostic lens on human–AI imaginaries. The convergence across GPT-4 and Llama 3 indicates broader model-level tendencies rather than vendor-specific behavior, while cloze probes situate model knowledge within known cultural sources.

Conclusion

The paper contributes a mixed behavioral–computational framework for studying cultural imaginaries and social bias through fictional prompts, enabling direct comparisons between human and LLM storytelling. It shows: (1) strong, cross-source adherence to the Pygmalion paradigm; (2) LLMs (especially GPT-4) more frequently adopt progressive gender/sexuality roles than human crowdwriters, though descriptive biases persist; and (3) default LLM narratives are less imaginative in scenario and rhetoric and more formulaic/moralizing than human-authored stories, with improvements possible via targeted prompting and parameter tuning. The work proposes using such experimental narratives as baselines for future analyses (e.g., repeated MTurk cohorts to assess LLM use on platforms, expert creative-writing evaluations, richer prompt designs). It also outlines implications for technology design, HCI, and LLM training by providing comparative ground truths for narrative quality and representation. Future research should expand corpora, diversify prompts beyond well-known tropes, explore interactive co-creation workflows, and deepen quantitative modeling of narrative elements and bias.

Limitations
  • Sample and timing: Human data (2019) predates widespread LLM access; computational runs (2023–2024) used newer models, introducing temporal confounds.
  • Prompts: Highly generic Pygmalion prompts may channel formulaic responses in LLMs and cliché-driven human writing; trope familiarity limits novelty. Prompt effects on gender are strong and may overshadow other factors.
  • Corpus scope: Story counts are modest for some subgroup analyses; character-level statistics are smaller and relatively uniform vs large-scale literary corpora.
  • Data quality: A small portion of human submissions were non-fictional or copied. Crowdworkers are all US-based native English speakers, limiting cultural generalizability.
  • Model opacity: GPT training data are undisclosed; memorization assessments via cloze are indicative but not exhaustive. Default guardrails and settings constrain creativity; limited exploration of hyperparameters and interactive prompting.
  • Race/ethnicity measurement: Human racial diversity partly arose from post-story questionnaires; narratives seldom engaged race/culture. GPT avoided race unless asked, complicating direct comparative analysis of representation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny