logo
ResearchBunny Logo
On the visual analytic intelligence of neural networks

Computer Science

On the visual analytic intelligence of neural networks

S. Woźniak, H. Jónsson, et al.

This research delves into the visual analytic intelligence of neural networks, comparing a biologically inspired system with a traditional relational network architecture. Conducted by Stanisław Woźniak, Hlynur Jónsson, Giovanni Cherubini, Angeliki Pantazi, and Evangelos Eleftheriou, the study reveals that the biologically inspired network not only outperforms in accuracy but also learns faster with fewer parameters.... show more
Introduction

The study addresses whether biologically inspired neural architectures can achieve visual analytic reasoning comparable to and more efficient than conventional relation-network-based approaches. Abstract reasoning has been a hallmark of human intelligence and recent deep learning successes often rely on large, energy-intensive models. The visual oddity task, originally used to probe core geometrical concepts in humans, serves as a challenging benchmark requiring inference of relations such as symmetry, topology, and transformations. Prior neural approaches (e.g., Relation Networks and WReN) and cognitive models have tackled related reasoning tasks, but either operate with large non-biological architectures or rely on predefined geometric abstractions. Motivated by the efficiency and temporal dynamics of the brain—particularly saccadic eye movements and neocortical neuron behavior—the authors aim to develop: (1) an Oddity Relation Network (OReN) adapted to the oddity task, and (2) a biologically inspired saccadic neural network using neuron models with leaky integrate-and-fire-like dynamics. The purpose is to compare their reasoning capability, efficiency, and mechanisms, and to explore whether temporal dynamics combined with saccadic input can serve as a computational primitive for relational reasoning.

Literature Review

The paper situates its work within several strands: (a) Relation Networks (RN) that compute pairwise relations over embeddings and have excelled on CLEVR; WReN extends RN for Raven's Progressive Matrices by inferring relations among context and candidate panels. Recurrent RN modules have been used for multi-step relational problems like Sudoku, and theoretical analyses have linked neural computations to algorithmic reasoning. (b) The visual oddity task (Dehaene et al., 2006) demonstrated cross-cultural access to core geometric knowledge, with differences emerging for more complex transformations. (c) Prior computational models (Lovett et al.) relied on glyph-based representations and structure-mapping engines grounded in predefined geometric concepts to find commonality and identify oddities. (d) Saccadic eye movements have been modeled for predicting human scanpaths and for reducing input volume, but less as a computational primitive for reasoning. (e) Spiking Neural Networks (SNNs) and LIF neuron models offer efficient, temporally dynamic processing well-suited to sequential inputs. The authors leverage these insights to contrast a data-driven RN-based solution without prior geometric symbols against a biologically inspired saccadic, stateful architecture.

Methodology

Dataset: The authors procedurally generated a large visual oddity dataset corresponding to the 45 riddles from the original human study. Each sample comprises six 100×100 8-bit grayscale frames with randomized but controlled attributes; five frames exhibit a specific geometric concept and one violates it (the oddity). Background grayscale values are sampled in [235,255] and figure elements in [0,61]. Generated frames are enforced to be pixel-wise unique. They produced 45 separate datasets (size 3,840 each) and one combined dataset (size 108,000). Splits are 4:1:1 for train:validation:test. For multi-riddle datasets, the riddle index is sampled per sample; for single-riddle datasets, it is fixed.

OReN architecture: Each of the six frames passes through a shared 5-layer CNN vision model producing a D-dimensional embedding per frame. For each frame k, ordered pairs are formed by concatenating its embedding with each of the six embeddings, yielding 36 pairs overall. Each pair is processed by g_θ, a four-layer MLP with N ReLU units per layer. For frame k, the six outputs corresponding to pairs with k as the first element are summed. The sum is input to f_ω, a three-layer MLP (two N-unit ReLU layers plus a linear output) producing a score q_k. A softmax over the six scores yields the probability that each frame is the oddity. The CNN vision model is identical between OReN and the saccadic network for fair comparison.

Saccadic neural network: Input is a temporal sequence simulating saccades over the six frames. To balance exposure, the sequence consists of six random permutations of the six frames (36 steps). At each time step t, the network outputs p(oddity|t) for the current frame. During testing, S_i = 18 initial saccades initialize temporal dynamics and S_e = 18 evaluation saccades are integrated for the final decision.

Recurrent units implement biologically inspired dynamics via Spiking Neural Units (SNUs), an abstraction of LIF neurons. With input x_t, internal state s_t (membrane potential), output y_t, input and recurrent weights W, H, leak λ, and bias b (threshold), the layer dynamics are: s_t = g(W x_t + H y_{t-1} + λ s_{t-1} ⊙ (1 − y_{t-1})); y_t = h(s_t + b). Three recurrent layers of N units are followed by a single sigmoid readout neuron. The formulation supports spike-based SNN mode (h as Heaviside), soft SNU mode (h as sigmoid), soft SNU with layer-wise recurrency (sSNU-R), and standard LSTM units for comparison. The same CNN generates D-dimensional embeddings fed to the recurrent stack, along with eye-position inputs.

Training setups: Two regimes were used: (1) separate training—distinct models per riddle; (2) joint training—a single model across all 45 riddles. Model capacity N varied geometrically: separate setup N ∈ [16, 256]; joint setup N ∈ [64, 4096] depending on unit type. Performance was evaluated on the respective held-out test sets. The study also measured epochs required to reach average human-level accuracy under their training data conditions.

Key Findings
  • Separate training (per-riddle): • OReN achieved 98.3% average test accuracy at N=128 (~1.0M parameters). • Saccadic networks with LSTM units: 98.2% accuracy at N=256 (~4.7M parameters). • Biologically inspired saccadic networks performed best with fewer parameters: SNN 99.0% at N=128 (~0.5M), sSNU 99.0% at N=64 (~0.3M), sSNU-R 99.0% at N=32 (~0.2M). Even the smallest sSNU (N=16, ~0.15M) reached 98.7%, surpassing OReN and LSTM baselines with far fewer parameters.

  • Joint training (single model across 45 riddles): • OReN best: 95.2% at N=1024 (~12M parameters). • sSNU-R best overall: 97.0% at N=2048 (~28M). • SNN: 95.3% at N=1024 (~5.5M). sSNU: 96.3% at N=256 (~1.0M). LSTM-based saccadic: 96.4% at N=512 (~12M). • Considering accuracy vs. parameter count, sSNU-R and sSNU provided the most favorable trade-offs in both setups.

  • Human comparison and difficulty: No statistically significant correlation between model accuracy and human difficulty index across riddles; model accuracies exceed average reported human performance (Munduruku ~66.8%, educated American adults ~84.8%) and a prior computational model (~86.7%), though under non-comparable training/testing conditions.

  • Sample efficiency during training: To reach human-level average accuracy in their setup, the SNN saccadic model needed ~3.5 epochs (~8,932 examples), while OReN (N=128) needed ~4.1 epochs (~10,524 examples).

  • Mechanism insight: Visualization shows OReN performs pairwise comparisons in space (simultaneous embedding pairs), while the saccadic network performs comparisons over time via membrane potential dynamics; both rely on repeated application of shared computations (parameter reuse) and accumulation—over space for OReN, over time for saccadic SNU layers.

Discussion

The findings demonstrate that a biologically inspired architecture—combining simulated saccadic input sequences with stateful neuron dynamics (SNU/LIF-like)—supports visual analytic reasoning, achieving higher accuracy with fewer parameters than an RN-based approach adapted to the task. This addresses the central question of whether temporal dynamics can substitute spatial pairwise comparisons: the saccadic network implicitly stores context in neuronal states and evaluates relations sequentially, paralleling OReN’s explicit spatial comparisons. The results suggest that taking cues from biological vision (saccades) and neural dynamics can yield more parameter-efficient reasoning systems, with potential benefits for low-power and edge applications. Although the models surpass reported human averages, direct comparison is inappropriate due to different experimental conditions (extensive training data drawn from the same generative process). The mechanistic analyses further validate that both architectures share a fundamental comparison-and-aggregation mechanism realized in different domains (time vs. space), providing a conceptual bridge between biologically inspired and conventional relational models.

Conclusion

The paper contributes a procedurally generated visual oddity dataset aligned with core geometric concepts, an Oddity Relation Network (OReN) tailored to the task, and a novel biologically inspired saccadic neural architecture using SNU-based dynamics. Empirically, saccadic models (especially sSNU and sSNU-R) outperformed OReN in accuracy-parameter efficiency across both separate and joint training regimes, and matched or exceeded LSTM-based versions with much smaller models. Mechanistic visualizations support the hypothesis that saccadic networks realize relational reasoning over time, while RNs do so over space. These results point to efficient, brain-inspired pathways for incorporating abstract reasoning into vision systems, potentially impactful for edge scenarios such as anomaly detection and object classification. Future work could explore more human-comparable learning setups (few-shot or no training on task-specific samples), direct energy measurements and neuromorphic implementations, robustness and generalization to out-of-distribution concepts, and learning of saccade policies rather than fixed permutations.

Limitations
  • Human comparison is not directly comparable: models are trained and tested on many procedurally generated samples from the same distribution, unlike humans who see single instances without prior task-specific training.
  • The joint training setting remains more challenging; spiking SNN variants showed relatively larger accuracy drops compared to soft SNUs.
  • Energy efficiency is argued from architectural principles; no direct runtime or power measurements are reported.
  • Saccade sequences are synthetically fixed to balanced permutations rather than learned or measured from human behavior.
  • The dataset, while diverse, is procedurally defined around 45 known concepts; generalization to novel geometric concepts or real-world visuals is not evaluated.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny