logo
ResearchBunny Logo
Zero-shot visual reasoning through probabilistic analogical mapping

Psychology

Zero-shot visual reasoning through probabilistic analogical mapping

T. Webb, S. Fu, et al.

Discover how human-like reasoning in visual context can outperform current algorithms. The innovative visiPAM model, developed by Taylor Webb, Shuhao Fu, Trevor Bihl, Keith J. Holyoak, and Hongjing Lu, showcases remarkable performance on analogical mapping tasks, closely resembling human capabilities.... show more
Introduction

The study investigates how humans can perform highly abstract visual analogies across disparate categories without direct task-specific training, and seeks computational principles to enable zero-shot analogical reasoning in machines. Traditional cognitive models of analogy posit structured representations with bindings between entities and relations, and mapping based on similarity, but typically rely on hand-crafted symbolic representations that do not explain how such structure is derived from raw visual inputs. Conversely, many deep learning approaches attempt to solve visual analogy tasks end-to-end from pixels, often requiring large training sets and showing limited generalization to novel content. The paper proposes a synthesis—visiPAM—that derives structured visual representations from naturalistic inputs and applies a probabilistic, similarity-based analogical mapping mechanism, aiming to achieve zero-shot visual reasoning comparable to human capabilities.

Literature Review

Two main traditions are reviewed. (1) Cognitive science on analogical reasoning emphasizes structured representations and similarity-based mapping (source–target correspondences) but generally assumes hand-designed, symbolic inputs; extracting relations from non-relational visual inputs remains a challenge. (2) Deep learning approaches to visual analogies train end-to-end on large datasets (sometimes >1M analogy problems), often lacking robust out-of-distribution generalization. Recent models reduce data needs and add relational bottlenecks, but still rely on task-specific training and do not achieve zero-shot reasoning. Prior work such as SSMN tackled part-matching with direct supervision, while other neurosymbolic and SME-based methods operate on simplified or hand-extracted symbols, limiting applicability to complex natural images. This context motivates combining learned visual embeddings with principled analogical mapping to enable zero-shot performance.

Methodology

Model: visiPAM consists of (1) a vision module that constructs structured, attributed graphs from visual inputs and (2) a reasoning module that performs Probabilistic Analogical Mapping (PAM) to find correspondences between source and target graphs. Nodes represent object parts with learned embeddings; directed edges represent spatial relations with learned or engineered edge attributes. Visual representations: For 2D images, node attributes are extracted using iBOT (a self-supervised ViT-L/16 masked-image-modeling model pretrained on ImageNet), producing 1024-d patch embeddings that are interpolated at part coordinates. For 3D point-clouds, node attributes are derived from a pre-trained DGCNN (trained on ShapeNetPart part segmentation for man-made objects only). Approximately 2000 points per object are embedded (64-d in the third EdgeConv layer); KMeans++ clusters points into eight part-like clusters, with node attributes as cluster-mean embeddings. Edge embeddings: Spatial relations are encoded using angular and relative location features. In 2D, r_theta uses cosine of angle differences relative to object centroid and inter-part directions; r_d concatenates coordinate range and pairwise vector difference magnitude. For 3D, analogous 3D versions are used. Topological connectivity features can be added as an augmentation. Reasoning module (PAM): Given source graph G and target G′, mapping matrix M (N×N) is inferred via Bayesian formulation maximizing similarity of mapped nodes and edges with a soft isomorphism prior. Log-likelihood sums cosine similarities over mapped edges and nodes, balanced by parameter α (α controls node vs edge contributions; α=0.9 in main experiments; ablations with α=1 or 0). A prior favoring one-to-one correspondences is implemented via an entropy-like term with strength β. Inference uses a graduated assignment algorithm: initialize M uniformly, iteratively update compatibility based on node and edge similarities with bistochastic normalization, and gradually increase β to approach one-to-one mappings. 500 iterations for 2D experiments; 200 for 3D. Tasks and datasets: (1) 2D part-matching: Evaluate on the test-only PPM dataset (cats, horses) from Choi et al., with 10 parts per image, zero-shot (no training on part matching). Also construct within-category vehicles (cars, planes) and between-category animal (cat→horse) subsets. (2) 3D analogies: Use ShapeNetPart chairs and Unreal Engine animal models to create 192 image pairs across same vs different superordinate categories. Human experiment: participants move colored markers on target images to locations analogous to source markers; analyze variability and agreement via distances and repeated-measures ANOVA. Model-to-human comparison: visiPAM maps clusters between source and target point-clouds, then locates target marker by minimizing combined local/global/feature distance discrepancy, and projects to 2D for comparison to human mean placements.

Key Findings

2D part-matching: VisiPAM outperforms SSMN (which was trained on 37,330 problems) despite zero-shot operation. Accuracy (chance-normalized in parentheses): within-category animals 63.2% (59.1%); within-category vehicles 69.5% (58.8%); between-category animals 67.9% (59.9%). SSMN within-category animals: 46.6% (40.7%). This corresponds to about a 30% relative reduction in error rate vs SSMN. Ablations: nodes-only (α=1) and edges-only (α=0) reduce performance (e.g., animals nodes-only 55.5%, edges-only 47.8%), showing both node and edge similarity are important. Edge embedding components (angular and relative location) both contribute; adding topology further improves performance. Robustness: performance is unaffected by horizontal reflection of target images. Typical errors involve left–right confusions of lateralized parts; mapping planes is particularly challenging (lowest within-category accuracy reported 42.5%). 3D analogies and human comparison: Human participants show low variance in same-category mappings and higher variance for different-category mappings. Repeated-measures ANOVA: main effect of category consistency F1,40 = 625.37, p < 0.0001; interaction with target category (larger effect for animal targets) F1,40 = 19.29, p < 0.0001; no main effect of target category. VisiPAM reproduces these qualitative patterns: larger deviation from human mean in different- vs same-category conditions, with a larger effect when targets are animals. Average distance from human mean: humans ≈ 20 px; visiPAM ≈ 25 px (object sizes ~213 px height, 135 px width). Item-level correlation between visiPAM and human distances r = 0.70. Ablations reduce alignment with human behavior (nodes-only r = 0.61; edges-only r = 0.60).

Discussion

Findings demonstrate that combining rich learned visual embeddings with a probabilistic, similarity-based analogical mapping mechanism enables zero-shot visual reasoning that can outperform task-trained deep models and align closely with human mapping behavior. VisiPAM leverages both object-level (node) and relational (edge) similarities; ablation results show both constraints are necessary, reflecting how human analogical reasoning benefits from multiple similarity cues. The model generalizes across within- and between-category analogies and captures human variability patterns in 3D cross-category mappings, indicating that the proposed synthesis addresses the challenge of mapping structured relations from naturalistic, high-dimensional inputs without direct training on the target task.

Conclusion

The paper introduces visiPAM, a framework that extracts structured visual graphs from 2D images or 3D point-clouds and performs probabilistic analogical mapping to enable zero-shot visual reasoning. Empirically, visiPAM surpasses a state-of-the-art end-to-end deep learning model on 2D part-matching without task-specific training and reproduces key qualitative and quantitative aspects of human analogical mappings in 3D. The work illustrates a promising pathway for integrating representation learning with cognitively inspired reasoning. Future directions include deriving 3D/topological structure directly from 2D inputs, incorporating semantic/functional knowledge (e.g., word or multimodal embeddings), adding top-down feedback to modulate perception during mapping, and extending the framework to multi-example schema induction.

Limitations

Limitations include: (1) reliance on 3D point-cloud inputs or pre-specified part coordinates; no current method to infer full 3D/topological structure directly from 2D natural images; (2) difficulties with objects exhibiting highly variable 3D pose/view (e.g., planes) when only 2D spatial relations are used; (3) common left–right confusions for lateralized parts; (4) dependence on pretrained visual encoders (iBOT, DGCNN) trained on datasets that may not cover all target domains (e.g., DGCNN trained only on man-made objects); and (5) current bottom-up pipeline lacks top-down reasoning feedback to refine visual representations during mapping.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny