Psychology
Zero-shot visual reasoning through probabilistic analogical mapping
T. Webb, S. Fu, et al.
The study investigates how humans can perform highly abstract visual analogies across disparate categories without direct task-specific training, and seeks computational principles to enable zero-shot analogical reasoning in machines. Traditional cognitive models of analogy posit structured representations with bindings between entities and relations, and mapping based on similarity, but typically rely on hand-crafted symbolic representations that do not explain how such structure is derived from raw visual inputs. Conversely, many deep learning approaches attempt to solve visual analogy tasks end-to-end from pixels, often requiring large training sets and showing limited generalization to novel content. The paper proposes a synthesis—visiPAM—that derives structured visual representations from naturalistic inputs and applies a probabilistic, similarity-based analogical mapping mechanism, aiming to achieve zero-shot visual reasoning comparable to human capabilities.
Two main traditions are reviewed. (1) Cognitive science on analogical reasoning emphasizes structured representations and similarity-based mapping (source–target correspondences) but generally assumes hand-designed, symbolic inputs; extracting relations from non-relational visual inputs remains a challenge. (2) Deep learning approaches to visual analogies train end-to-end on large datasets (sometimes >1M analogy problems), often lacking robust out-of-distribution generalization. Recent models reduce data needs and add relational bottlenecks, but still rely on task-specific training and do not achieve zero-shot reasoning. Prior work such as SSMN tackled part-matching with direct supervision, while other neurosymbolic and SME-based methods operate on simplified or hand-extracted symbols, limiting applicability to complex natural images. This context motivates combining learned visual embeddings with principled analogical mapping to enable zero-shot performance.
Model: visiPAM consists of (1) a vision module that constructs structured, attributed graphs from visual inputs and (2) a reasoning module that performs Probabilistic Analogical Mapping (PAM) to find correspondences between source and target graphs. Nodes represent object parts with learned embeddings; directed edges represent spatial relations with learned or engineered edge attributes. Visual representations: For 2D images, node attributes are extracted using iBOT (a self-supervised ViT-L/16 masked-image-modeling model pretrained on ImageNet), producing 1024-d patch embeddings that are interpolated at part coordinates. For 3D point-clouds, node attributes are derived from a pre-trained DGCNN (trained on ShapeNetPart part segmentation for man-made objects only). Approximately 2000 points per object are embedded (64-d in the third EdgeConv layer); KMeans++ clusters points into eight part-like clusters, with node attributes as cluster-mean embeddings. Edge embeddings: Spatial relations are encoded using angular and relative location features. In 2D, r_theta uses cosine of angle differences relative to object centroid and inter-part directions; r_d concatenates coordinate range and pairwise vector difference magnitude. For 3D, analogous 3D versions are used. Topological connectivity features can be added as an augmentation. Reasoning module (PAM): Given source graph G and target G′, mapping matrix M (N×N) is inferred via Bayesian formulation maximizing similarity of mapped nodes and edges with a soft isomorphism prior. Log-likelihood sums cosine similarities over mapped edges and nodes, balanced by parameter α (α controls node vs edge contributions; α=0.9 in main experiments; ablations with α=1 or 0). A prior favoring one-to-one correspondences is implemented via an entropy-like term with strength β. Inference uses a graduated assignment algorithm: initialize M uniformly, iteratively update compatibility based on node and edge similarities with bistochastic normalization, and gradually increase β to approach one-to-one mappings. 500 iterations for 2D experiments; 200 for 3D. Tasks and datasets: (1) 2D part-matching: Evaluate on the test-only PPM dataset (cats, horses) from Choi et al., with 10 parts per image, zero-shot (no training on part matching). Also construct within-category vehicles (cars, planes) and between-category animal (cat→horse) subsets. (2) 3D analogies: Use ShapeNetPart chairs and Unreal Engine animal models to create 192 image pairs across same vs different superordinate categories. Human experiment: participants move colored markers on target images to locations analogous to source markers; analyze variability and agreement via distances and repeated-measures ANOVA. Model-to-human comparison: visiPAM maps clusters between source and target point-clouds, then locates target marker by minimizing combined local/global/feature distance discrepancy, and projects to 2D for comparison to human mean placements.
2D part-matching: VisiPAM outperforms SSMN (which was trained on 37,330 problems) despite zero-shot operation. Accuracy (chance-normalized in parentheses): within-category animals 63.2% (59.1%); within-category vehicles 69.5% (58.8%); between-category animals 67.9% (59.9%). SSMN within-category animals: 46.6% (40.7%). This corresponds to about a 30% relative reduction in error rate vs SSMN. Ablations: nodes-only (α=1) and edges-only (α=0) reduce performance (e.g., animals nodes-only 55.5%, edges-only 47.8%), showing both node and edge similarity are important. Edge embedding components (angular and relative location) both contribute; adding topology further improves performance. Robustness: performance is unaffected by horizontal reflection of target images. Typical errors involve left–right confusions of lateralized parts; mapping planes is particularly challenging (lowest within-category accuracy reported 42.5%). 3D analogies and human comparison: Human participants show low variance in same-category mappings and higher variance for different-category mappings. Repeated-measures ANOVA: main effect of category consistency F1,40 = 625.37, p < 0.0001; interaction with target category (larger effect for animal targets) F1,40 = 19.29, p < 0.0001; no main effect of target category. VisiPAM reproduces these qualitative patterns: larger deviation from human mean in different- vs same-category conditions, with a larger effect when targets are animals. Average distance from human mean: humans ≈ 20 px; visiPAM ≈ 25 px (object sizes ~213 px height, 135 px width). Item-level correlation between visiPAM and human distances r = 0.70. Ablations reduce alignment with human behavior (nodes-only r = 0.61; edges-only r = 0.60).
Findings demonstrate that combining rich learned visual embeddings with a probabilistic, similarity-based analogical mapping mechanism enables zero-shot visual reasoning that can outperform task-trained deep models and align closely with human mapping behavior. VisiPAM leverages both object-level (node) and relational (edge) similarities; ablation results show both constraints are necessary, reflecting how human analogical reasoning benefits from multiple similarity cues. The model generalizes across within- and between-category analogies and captures human variability patterns in 3D cross-category mappings, indicating that the proposed synthesis addresses the challenge of mapping structured relations from naturalistic, high-dimensional inputs without direct training on the target task.
The paper introduces visiPAM, a framework that extracts structured visual graphs from 2D images or 3D point-clouds and performs probabilistic analogical mapping to enable zero-shot visual reasoning. Empirically, visiPAM surpasses a state-of-the-art end-to-end deep learning model on 2D part-matching without task-specific training and reproduces key qualitative and quantitative aspects of human analogical mappings in 3D. The work illustrates a promising pathway for integrating representation learning with cognitively inspired reasoning. Future directions include deriving 3D/topological structure directly from 2D inputs, incorporating semantic/functional knowledge (e.g., word or multimodal embeddings), adding top-down feedback to modulate perception during mapping, and extending the framework to multi-example schema induction.
Limitations include: (1) reliance on 3D point-cloud inputs or pre-specified part coordinates; no current method to infer full 3D/topological structure directly from 2D natural images; (2) difficulties with objects exhibiting highly variable 3D pose/view (e.g., planes) when only 2D spatial relations are used; (3) common left–right confusions for lateralized parts; (4) dependence on pretrained visual encoders (iBOT, DGCNN) trained on datasets that may not cover all target domains (e.g., DGCNN trained only on man-made objects); and (5) current bottom-up pipeline lacks top-down reasoning feedback to refine visual representations during mapping.
Related Publications
Explore these studies to deepen your understanding of the subject.

