logo
ResearchBunny Logo
Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation

Computer Science

Target-Augmented Shared Fusion-based Multimodal Sarcasm Explanation Generation

P. Goel, D. S. Chauhan, et al.

Sarcasm often hides its target — and TURBO uncovers it. This study introduces TURBO, a Target-aUgmented shaRed fusion-Based sarcasm explanation model that leverages a novel shared-fusion mechanism to fuse image and caption and explicitly models the sarcasm target to generate clearer explanations. Evaluated on MORE+, TURBO outperforms baselines by an average +3.3%, and human assessments find its explanations superior; the authors also examine LLMs in zero/one-shot settings and note their limitations. This research was conducted by Palaash Goel, Dushyant Singh Chauhan, and Md Shad Akhtar.... show more
Introduction

Sarcasm involves statements that, taken literally, suggest one meaning, but given context, imply the opposite, creating an incongruity between explicit and implicit meanings. Resolving this incongruity is crucial for interpreting sarcastic messages, especially in multimodal settings where visual and textual cues both contribute. Prior work on multimodal sarcasm explanation (MuSE) established datasets and models but underutilized visual cues, treated all knowledge relations equally, and neglected the explicit role of the sarcasm target in guiding explanation generation. This paper hypothesizes that (i) both text and image carry complementary, non-redundant signals essential for understanding sarcasm; (ii) external knowledge should be weighted by relevance; and (iii) explicitly providing the target of sarcasm will guide models toward the intended irony and improve explanation quality. To test these hypotheses, the authors extend the MORE dataset with target annotations (MORE+) and propose TURBO, a target-augmented shared fusion framework that integrates multimodal features with a relevance-aware knowledge graph to generate focused explanations.

Literature Review

Early sarcasm detection focused on text-only features and neural architectures (e.g., pattern-based, emoji/punctuation cues, and deep learning). Recognizing multimodal content, subsequent work fused textual and visual features (CNNs; later GCNs) and extended detection to dialogues (e.g., MUSTARD dataset) and code-mixed languages. Generative approaches transformed sarcastic text to non-sarcastic interpretations and explored sarcasm generation. For MuSE, ExMORE (BART-based) established a baseline on the MORE dataset, later improved by TEAM via a multi-source semantic graph and external knowledge. Despite TEAM's strong performance, limitations include (a) conversion of visuals to textual metadata that may omit salient visual features; (b) equal weighting of extracted relations regardless of varying importance; and (c) lack of explicit target guidance. TURBO addresses these with shared fusion of visual and textual representations, relevance-weighted knowledge, and explicit targets.

Methodology

Dataset: The study uses MORE+, an extension of MORE with manually annotated 'target of sarcasm' labels for 3,510 sarcastic multimodal posts (Train: 2,983; Val: 175; Test: 352). Average caption length ~19.7 tokens; average explanation length ~15.4 tokens; targets average ~4.2 tokens. Images and captions come from platforms such as Twitter, Instagram, and Tumblr.

Model overview (TURBO): TURBO integrates three components: (a) knowledge infusion using external concepts with relevance; (b) a novel shared fusion mechanism for multimodal representation learning; and (c) explicit target incorporation to guide explanation generation.

Visual semantics extraction (three granularities):

  • Low-level: Generate a single natural language image description with BLIP (large) for each image.
  • Medium-level: Extract object entities via YOLOv9, retaining top-K objects by confidence (K=36), represented as text labels.
  • High-level: Obtain semantic-rich visual embeddings via a pre-trained ViT; features of size 50×768 projected to 256×768 with a learnable linear layer.

External knowledge retrieval: Query ConceptNet for one-hop neighboring concepts and relevance scores for tokens from the caption (C), BLIP description (D), and object labels (O), excluding stopwords. Extract (concept, relevance) pairs per token.

Knowledge enrichment: Concatenate original sequences with their corresponding external knowledge concepts while preserving token order: T_knowledge = C + CC + D + DC + O + OC. Ordering constraints maintain alignment between tokens and concepts.

Knowledge graph construction: Build an undirected, weighted graph G over enriched tokens: unit-weight edges between consecutive caption and description tokens; edges between tokens and their concepts weighted by relevance scores; object tokens linked only to their concepts (no object-token adjacency due to lack of syntax). This graph encodes non-sequential inter-token relations and knowledge relevance.

Target incorporation: Concatenate the target of sarcasm (TS) to the enriched sequence using a BART separator: T_concat = T_knowledge + + TS. BART encodes this into contextual text embeddings (E_t) of size N×768 (N=256 via padding/truncation).

Sarcasm reasoning via GCN: Apply L GCN layers over the knowledge graph to derive salient semantic node features. Each layer uses normalized adjacency and a learnable weight matrix with non-linear activation. H_0 = E_t, H_L is the final GCN output capturing reasoning-informed textual features.

Shared fusion mechanism: Perform self-attention separately on visual (E_v) and textual (E_t) embeddings to capture intra-modal relationships, producing A_v and A_t. Cross-modal feature exchange is achieved via element-wise modulation: F_vt = A_v ⊙ E_t; F_tv = A_t ⊙ E_v. Introduce gated fusion to balance contributions from multimodal and unimodal sources, computing gating weights G_v and G_t via sigmoid functions over linear projections of E_v and E_t. Construct four fused matrices: (1) F_1 and (2) F_2 combine F_tv and F_vt with gates; (3) F_v merges E_v and F_tv; (4) F_t merges E_t and F_tv. The shared fusion output F_SF is a weighted sum with learnable coefficients (α_1, α_2, β_1, β_2), enabling dynamic control over modality contributions per sample.

Explanation generation: Sum the GCN output and shared fusion features Z = H_L + F_SF, then fine-tune BART to generate explanations autoregressively.

Training details: TURBO is built on BART-base (feature dim 768). Visual embeddings 50×768 projected to 256×768; text sequence length N=256; K=36 objects. Optimizer: AdamW; learning rates 1e-3 (GCN) and 1e-4 (BART); epochs=20; batch size=16; hardware: Ubuntu, Tesla V100-PCIE-32GB GPU, ~9GB RAM.

MLLM baselines: Evaluate GPT-4o Mini, LLaVa-Mistral, and LLaVa-Llama-3 in zero-shot and one-shot settings for comparison.

Key Findings

Comparative performance on MORE+: TURBO surpasses baselines and the prior SOTA TEAM across all metrics.

  • BLEU: B1 57.09 vs TEAM 55.32 (+1.77%), B2 46.93 vs 45.12 (+1.81%), B3 40.28 vs 38.27 (+2.01%), B4 35.26 vs 33.16 (+2.10%); average BLEU margin +1.92%.
  • ROUGE: RL 53.12 vs 50.58 (+2.54%), R1 55.06 vs 51.72 (+3.34%), R2 38.16 vs 34.96 (+3.20%); average margin +3.33%.
  • METEOR: 55.17 vs 50.95 (+4.22%).
  • BERTScore: Precision 92.00 vs 91.80 (+0.20%), Recall 91.77 vs 91.60 (+0.17%), F1 91.86 vs 91.70 (+0.16%); average +0.18%.
  • SentBERT: 75.75 vs 72.92 (+2.83%). Against MLLMs (zero-/one-shot), TURBO outperforms all in automatic metrics, though standard metrics can under-represent explanation quality. Ablation study: Removing components reduces performance; variants without target of sarcasm consistently underperform counterparts, confirming TS utility. TURBO + TS Concepts (adding external concepts for TS) performs slightly worse than TURBO but still surpasses TEAM on most metrics. TURBO-KG, TURBO-SF, and their TS-removed versions show graded degradations, evidencing each module’s contribution. Human evaluation (20 samples × 20 evaluators; 5-point Likert): TURBO vs TEAM improvements: Fluency +4.40%, Semantic Accuracy +8.40%, Negative Connotation +6.80%, Target Presence +10.60%. TURBO is comparable to LLaVa-Llama3 (slightly lower in Fluency by 1.40%) and trails LLaVa-Mistral and GPT-4o Mini by ~4.00% and ~7.75% on average, respectively, despite TURBO’s ~234M parameters vs 7–8B for MLLMs.
Discussion

The findings validate the central hypothesis that explicitly modeling the target of sarcasm and jointly learning cross- and intra-modal relationships improves explanation quality for multimodal sarcasm. TURBO’s shared fusion enables dynamic weighting of unimodal and cross-modal features to handle varied sample dependencies on image or text, while relevance-aware knowledge infusion via a graph and GCN enhances reasoning about incongruity. TURBO’s consistent gains over TEAM across BLEU, ROUGE, METEOR, BERTScore, and SentBERT, combined with human evaluation improvements in semantic accuracy and target presence, demonstrate more faithful capture of ironic intent. Comparisons to large multimodal LLMs show TURBO achieves competitive quality with far fewer parameters, suggesting that task-specific architectures can rival general-purpose models on nuanced multimodal reasoning. Error analyses highlight avenues to further strengthen TURBO—particularly context-sensitive knowledge retrieval and improved handling of OCR-rich images—reinforcing the importance of precise visual-text alignment and world knowledge selection in sarcasm explanation.

Conclusion

The paper introduces TURBO, a target-augmented shared fusion framework that addresses key limitations of prior work in multimodal sarcasm explanation by (i) integrating relevance-weighted external knowledge, (ii) enabling dynamic cross/intra-modal fusion, and (iii) guiding generation via explicit sarcasm targets. The authors extend the MORE dataset to MORE+ with target annotations and demonstrate state-of-the-art performance over TEAM and strong competitiveness against multimodal LLMs in both automatic metrics and human evaluation. Future directions include designing context-aware knowledge retrieval mechanisms, more specialized integration of target annotations (e.g., target-aware fusion of unimodal/multimodal features), improved extraction of OCR-derived text cues, and building a pipeline to automatically generate targets of sarcasm to reduce annotation burdens while ensuring target quality.

Limitations

External knowledge concepts are extracted deterministically, ignoring sample context, which can yield irrelevant or missing concepts that misguide reasoning. The target of sarcasm is currently incorporated by simple concatenation with text tokens; more specialized, target-aware fusion could further amplify salient features. The reliance on manually annotated targets adds overhead for dataset extensions; an auxiliary model could generate targets automatically, though this introduces risk if target quality is poor. Error analysis also reveals challenges with insufficient OCR feature extraction and occasionally irrelevant image descriptions that can impair explanation accuracy.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny