Computer Science

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

A. Nandy, Y. Agarwal, et al.

Can AI spot a joke in a picture? This paper introduces three tasks—satirical image detection, satirical image understanding, and satirical image completion—and releases YesBut, a 2,547-image dataset (plus 119 real satirical photos) showing current vision-language models struggle in zero-shot settings. This research was conducted by the authors listed in <Authors> tag: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly.... show more

Introduction

Satire uses irony or exaggeration to criticize or mock aspects of society, demanding the comprehension of conflicting scenarios, entity interactions, optional text, and commonsense reasoning. Social-media satirical images often juxtapose normal and conflicting scenarios to create irony. The paper asks whether existing vision-language models can decipher satire in images and proposes three evaluation tasks—detection, understanding, and completion—requiring reasoning about punchlines and twists. The YesBut dataset is built to enable this evaluation, with detailed annotations and multiple artistic styles.

Literature Review

Prior work on satire and humor has focused on text-only satire detection, multimodal satire detection, multimodal humor detection, meme/joke captioning, and related benchmarks (e.g., MemeCap, MET-Meme). WHOOPS is a benchmark of unconventional images for commonsense-challenging tasks (captioning, matching, VQA, explanation). However, no prior work comprehensively and simultaneously evaluates VL models’ abilities in satire detection, understanding, and completion. The rise of multimodal models (e.g., LLaVA, MiniGPT4, Kosmos-2; Gemini; GPT-4) has yielded SOTA performance on various tasks, often relying on shared image-text embedding spaces (e.g., CLIP). YesBut is designed to probe satire comprehension beyond recognition and captioning, contrasting with earlier datasets in text presence, sub-image composition, and multiple styles.

Methodology

Data creation and annotation occurred in four stages. Stage 1: 283 satirical images were collected (with consent) from the @yesbut account on X (Twitter). Each image has two sub-images (colorized sketches): the left shows a normal scenario; the right contradicts or pokes fun at the first, creating satire. Stage 2: Five qualified annotators provided textual descriptions for the left and right sub-images and an overall description capturing the punchline. They also annotated categorical/binary features: presence of text in left/right sub-images, whether sub-images are connected (i.e., could be parts of a single larger image), and annotation difficulty (EASY, MEDIUM, HARD depending on whether internet help was needed). Stage 3: To expand diversity, DALL-E 3 generated 2D stick-figure sub-images from the Stage 2 descriptions using a standardized prompt. New combinations were formed by mixing original and generated sub-images; each combined image was manually labeled as satirical or non-satirical, adding 302 satirical and 547 non-satirical images. Stage 4: Similarly, 3D stick-figure sub-images were generated with a 3D prompt and combined in five ways; manual labeling added 499 satirical and 916 non-satirical images. Quality validation: A separate annotator evaluated 25 satirical and 25 non-satirical randomly sampled images from Stages 3/4, with 94% agreement, indicating high dataset quality. Dataset analysis: Topic modeling on sub-image descriptions via BERTopic yielded seven topics, further elaborated with ChatGPT. Diversity visualization: CLIP image embeddings projected via UMAP showed 2D/3D stick-figure images were semantically diverse and distant from the originals despite sharing descriptions. Tasks and evaluation: Three tasks were defined—Satirical Image Detection (binary classification over all 2,547 images); Satirical Image Understanding (describe each sub-image and answer "Why is this image funny/satirical?" for 1,084 satirical images); Satirical Image Completion (choose the correct sub-image option to complete satire; 150 curated samples). Zero-shot and zero-shot chain-of-thought setups were used for detection/completion; understanding used zero-shot. Metrics: Accuracy and F1 for detection; BLEU, ROUGE-L, METEOR, BERTScore and an image-based Polos metric for understanding; accuracy for completion. Human evaluation: 30 images were sampled across stages, and model-generated overall descriptions were rated by three annotators (majority vote) on correctness, appropriate length, visual completeness, and faithfulness.

Key Findings

Satirical Image Detection (Table 3): No model exceeded 60% accuracy or F1. Best accuracy: Kosmos-2 (zero-shot CoT) at 56.97%. Best F1: Kosmos-2 (zero-shot) at 59.71%. Chain-of-thought improved accuracy in only 2/5 models and F1 in 1/5, suggesting limited reasoning gains for detection. Satirical Image Understanding (Fig. 5; Table 6): Average automated metric values (normalized 0–1) were below 0.4 across models. Understanding of the overall punchline (WHYFUNNY) decreased in Stages 3/4 (mixed artistic styles) compared to Stage 2. Kosmos-2 outperformed LLaVA and MiniGPT4 among open-source models; MiniGPT4 consistently underperformed. In most cases, models understood sub-images better than entire images. Presence of text in images improved performance in 15/20 metric cases across models (Table 8). Difficult images correlated with lower semantic performance (Table 7). Polos metric (Table 9) showed all models performing poorly, with Gemini and GPT-4 slightly higher than others across stages. Satirical Image Completion (Table 4): Chain-of-thought helped 3/5 models; MiniGPT4 showed the largest CoT improvement among open-source models (from 40.00% to 60.67%). Best overall performance was Gemini (zero-shot: 61.11%; zero-shot CoT: 61.81%). Human evaluation (Figure 14): Even the best model lagged human performance by large margins on correctness (−40 points), appropriate length (−43.33), visual completeness (−33.33), and faithfulness (−36.66). Real photographs (Table 5): On a set of 119 real satirical images, detection and understanding remained challenging—three of five models performed poorly on detection; all models achieved less than 50% accuracy on understanding. GPT-4 had highest detection (93.27%) and 46.22% on understanding; Gemini had 80.67% detection and 19.33% understanding; Kosmos-2 had 66.39% detection and 10.92% understanding.

Discussion

Results show that state-of-the-art VL models struggle to detect, understand, and complete satirical images, particularly when text is absent and multiple artistic styles coexist within a single image. Limited benefit from zero-shot chain-of-thought for detection indicates insufficient reasoning about satire without explicit training or context. Understanding the overall satirical punchline is notably harder than describing sub-images, consistent with the need for higher-level commonsense and social reasoning. Performance is sensitive to the presence of textual cues and image difficulty, aligning with human judgments. The Polos metric and human evaluation corroborate that models’ generated explanations are often incomplete, unfaithful, or incorrect. Collectively, these findings suggest large room for improvement in multimodal satire comprehension and highlight YesBut as a challenging benchmark for advancing cross-modal reasoning, grounding, and societal context understanding.

Conclusion

YesBut is introduced as a high-quality, annotated, multimodal dataset to evaluate vision-language models on satirical image detection, understanding, and completion. Systematic benchmarks in zero-shot and zero-shot chain-of-thought settings reveal that models underperform across tasks, especially in understanding overall punchlines and detecting satire without textual cues. The dataset’s design—sub-images with different artistic styles and frequent absence of text—creates a challenging testbed that exposes current limitations. Future work should explore improved multimodal grounding, explicit commonsense and social reasoning, better handling of stylistic variations, and training/evaluation regimes beyond zero-shot. The release of an additional set of real satirical photographs further supports research on practical satire comprehension.

Limitations

Annotations involve subjective judgments and background knowledge, which may vary among annotators; manual reviews were conducted but some subjectivity remains. The work is currently limited to English, with planned extensions to other languages.

Related Publications

Explore these studies to deepen your understanding of the subject.

Social Work

Evaluating the effectiveness of the Kidogo model in empowering women and strengthening their capacities to engage in paid labor opportunities through the provision of quality childcare: a study protocol for an exploratory study in Nakuru County, Kenya

K. Okelo, M. Nampijja, et al.

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

Psychology

Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation

E. C. Stade, S. W. Stirman, et al.

Medicine and Health

A framework for human evaluation of large language models in healthcare derived from literature review

T. Y. C. Tam, S. Sivarajkumar, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny