logo
ResearchBunny Logo
Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology

Computer Science

Envisioning Narrative Intelligence: A Creative Visual Storytelling Anthology

B. A. Halperin and S. M. Lukin

Discover an exciting anthology of 100 visual stories that delve into narrative intelligence and the art of storytelling, created by Brett A Halperin and Stephanie M Lukin. This research showcases innovative themes from sensory experiences to mood modulation, enhancing automated story generation through rich data collection techniques.... show more
Introduction

The paper investigates how humans create stories from sequences of images to inform the development of computational narrative intelligence. The central question is: What are the varied ways in which human authors approach the same systematic creative process of improvised story-building based on image sequences? The authors argue that understanding human creative processes—beyond literal image descriptions—is crucial before modeling automated visual storytelling systems, especially given the risks of reproducing human biases at scale. They situate the work within longstanding interplay between imagery and language, emphasizing that creative visual storytelling goes beyond describing scenes to constructing narrative arcs that may extend beyond what is depicted. The study aims to establish an anthology of human-authored visual stories and analyze them to derive themes and criteria that can guide data collection and model design for automatic, creative, and responsible visual storytelling. The importance lies in capturing creativity, grounding, and bias awareness to inform future AI systems that generate visual stories.

Literature Review

The related work surveys subjectivities in storytelling and systems, highlighting that image-based storytelling involves literal (denotation) and symbolic (connotation) messages (Barthes), and that multiple interpretations of the same images are common. Prior datasets like VIST separate image descriptions from narratives but often constrain creativity via image-to-sentence alignment and emphasize objective descriptions, limiting backstories or offstage conflicts. HCI scholarship emphasizes the plurality of stories and cautions against false objectivity that can mask biases; work on critical fabulations and bias in computer vision underscores the harms of algorithmically scaled bias. Existing visual storytelling resources over-rely on high-quality, canonical images (e.g., Flickr), neglecting low-resource, low-quality, atypical perspectives that may elicit different narratives. On computational models, many image-to-text systems prioritize objective summarization or event description and optimize for a single best continuation, limiting creative expression. Some systems use captions as seeds for stylistic storytelling (e.g., Pix2Story), but evaluations indicate a need for creativity grounded in visuals. Top-down models trained on crowdsourced text often lack controllability and explainability, constraining creativity. The authors identify a gap for approaches that elicit and analyze human creative storytelling grounded in images yet allowing imagination, with attention to bias and diverse image sources.

Methodology

The study designed and deployed a crowdsourced writing protocol on Amazon Mechanical Turk (AMT) to collect an anthology of human-authored visual stories from sequences of three images, modeling a beginning–middle–end progression. A single author created one anthology item per image sequence via four structured facets:

  • Entity Facet (What is here?): For each image, authors listed resonant visible entities/objects with brief characteristics and confidence ratings; not exhaustive.
  • Scene Facet (What happens here?): For each image, authors provided factual, spatiotemporal descriptions (location, time, activities), building on entities.
  • Narrative Facet (Tell a story): Multi-image storytelling across images as they were revealed sequentially (image 1, then image 2, then image 3). Authors adapted and extended their story with each new image; minimum of four complete English sentences per narrative entry.
  • Title Facet (Story title): An open-ended title synthesizing the story across the three images. Task flow and interface: The HIT progressed page by page. Page 1 showed image 1 with Entity and Scene facets. Page 2 introduced image 2 with its Entity and Scene facets and the multi-image Narrative Facet referencing images 1–2. Page 3 introduced image 3 with its Entity and Scene facets, the Narrative Facet over images 1–3 (with prior narrative shown for reference), and the Title Facet. Authors could not edit prior facet entries. Scene required at least two sentences; Narrative required at least four complete, fluent English sentences. Example responses were provided; authors were encouraged to be creative. Compensation was $2.50 per HIT (10–15 minutes estimated). Participants: AMT workers with ≥100 prior approved HITs, located in Australia, Canada, UK, or US. Responses were manually reviewed; rejections only for not following English completeness/fluency instructions, with disqualification for future batches if rejected. A demographic survey was requested but not enforced, resulting in missing metadata. Image sources: 20 image sequences (each of 3 images) came from two sources: (1) Flickr images (Creative Commons), largely high-quality everyday scenes (adapted from VIST, downselected to three images, with people/children in clear focus removed for privacy); and (2) Search-and-Rescue (SAR) scenarios from prior human-robot studies (low-resolution, dim lighting, robot-mounted camera perspective of unfinished indoor spaces), selected to scaffold quest-like plots and diversify environment/presentation (low-resource domain), and for practical relevance. Dataset statistics: Five different authors wrote for each of the 20 image sequences, producing 100 anthology items (stories) from 73 unique authors. Totals: 300 Entity entries and 300 Scene entries (per image), 200 Narrative entries (two per item: after image 2 and after image 3), and 100 Title entries. Thus, 100 complete stories with titles. Analysis approach: Two annotators (the co-authors) conducted applied qualitative analysis for narrative knowledge engineering using thematic analysis and close reading. They iteratively coded all 100 stories across four rounds, writing memos and meeting to consolidate codes, then reviewed all stories one-by-one to finalize themes. Close reading attended to diction, figurative language, symbols, voice, and style to surface ambiguities, tensions, and patterns. The objective was to derive themes intrinsic to the data to inform criteria for computational visual storytelling.
Key Findings

From thematic analysis of 100 human-authored visual stories over 20 image sequences (73 unique authors), five cross-cutting themes emerged:

  1. Narrating What is in Vision vs. Envisioning: Authors varied along a spectrum from (a) captioning literal descriptions tied directly to visible content, to (b) commenting plausible deductions that extend beyond literal observation, to (c) contriving deviations that introduce off-screen elements or events not depicted (e.g., imagined characters like a troll), illustrating tension between grounding and imagination.
  2. Dynamically Characterizing Entities/Objects: Authors treated depicted items along a continuum: overlooked, static (inanimate), interactive (engaged by characters), active (imbued with agency affecting characters), or personified (ascribing human traits/names and roles), demonstrating flexible use of objects as plot devices (e.g., a plant named Diane influencing team morale).
  3. Sensing Experiential Information About the Scenery: Narratives ranged from unisensory (primarily visual) to multisensory (adding sound, smell, touch, and even implied taste), with multisensory storytelling often heightening aesthetics and plot tension (e.g., library story using smell of books, silence, and tactile comfort to build suspense).
  4. Modulating the Mood: Authors perceived initial moods (often gloomy in SAR scenes) and either intensified them (“gloom to doom” trajectories to ominous outcomes) or reversed them (“gloom to bloom” toward hopeful resolutions), using mood as a plot driver.
  5. Encoding Narrative Biases: Stories exhibited linguistic/cultural biases (e.g., recognizing/misrecognizing Arabic text and associated cultural inferences), perspective biases (e.g., Indigenous versus colonial viewpoints for the same landscape), and casting biases (e.g., stereotypes about construction workers, gender roles, and people experiencing housing insecurity), demonstrating risks of reproducing harmful tropes in story data. These findings motivate five narrative intelligence criteria for computational visual storytelling: creative, expressive, responsible, reliable, and grounded.
Discussion

The analysis shows that human visual storytelling involves tensions between literal grounding and imaginative expansion, dynamic use of entities/objects as plot devices, experiential rendering of scene through multi-sensory cues, mood-driven plot trajectories, and pervasive narrative biases. The authors propose narrative intelligence criteria—creative, expressive, responsible, reliable, grounded—to guide dataset curation, labeling, and model design. Foregrounding creative and expressive facets in data collection (e.g., separating Scene from Narrative, allowing multi-image storytelling without strict image-to-sentence alignment) can elicit richer, multisensory narratives beyond captions. Responsible practices require recognizing and addressing biases in inputs (who authors stories, which images are used, what cultural/linguistic elements are present) and outputs (plurality of perspectives rather than a single objective storyline), including annotating/filtering problematic content and pursuing representative data sourcing. Reliable and grounded generation entails aligning plot trajectories and entity/object characterization with the depicted storyworld boundaries, using guardrails to ensure critical visual information is incorporated and stories remain coherent with images, while still permitting creative variation (e.g., mood-based plot modulation). These considerations challenge single-best-story optimization and encourage plural, feasible continuations.

Conclusion

The paper presents an anthology of 100 crowdsourced visual stories elicited via a four-facet paradigm (Entity, Scene, Narrative, Title) and analyzed through thematic analysis and close reading. Five themes emerged: balancing vision and envisioning; dynamic characterization of entities/objects; unisensory versus multisensory experiential narration; mood modulation (gloom to doom/bloom); and encoded narrative biases across language, culture, perspective, and casting. From these, the authors propose narrative intelligence criteria—creative, reliable, expressive, grounded, responsible—to guide collection, labeling, and generation for computational visual storytelling. Future work includes releasing the anthology for community analysis and theory-building, and developing theoretical frameworks predicting that storytellers following the improvised process will confront the vision–envisioning tension. The dataset is intended to inform more equitable, expressive, and coherent visual story generation systems.

Limitations
  • Demographics: The demographic survey on AMT was optional and not enforced; lack of participant metadata limits analysis of how author backgrounds influence narratives and biases.
  • Sampling and representativeness: Authors were limited to certain English-speaking regions (AU, CA, UK, US), potentially limiting cultural diversity; five authors per image sequence may limit coverage of narrative variability.
  • Image domains: Flickr images emphasized everyday scenes with privacy-filtered people; SAR images lacked people and were low-resolution. Limited inclusion of people may under-represent certain bias dynamics (e.g., racial bias).
  • Scale: 100 stories provide qualitative depth but may be limited for training large-scale models or generalizing all observed patterns.
  • Task constraints: Authors could not edit earlier responses; sequential reveal supports improvisation but may bias narrative development compared to holistic planning.
  • No automated model evaluation: The study focuses on human-authored data and analysis; it does not implement or quantitatively evaluate automated storytelling models against the proposed criteria.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny