Cognitive Science

Brain-optimized deep neural network models of human visual areas learn non-hierarchical representations

G. St-yves, E. J. Allen, et al.

This study by Ghislain St-Yves and colleagues explores whether hierarchical representations are a must for predicting brain activity in the primate visual system. Surprisingly, they find that a single-branch deep neural network outperformed its multi-branch counterpart, challenging prevailing assumptions about brain-like DNN architectures. Discover how insights from human visual areas V1–V4 could reshape our understanding of neural representation!

00:00

~3 min • Beginner • English

Index

Introduction

The study examines whether hierarchical representations are essential for accurately predicting human brain activity in early-to-intermediate visual areas (V1–V4) using DNN-based encoding models. Prior work shows that task-optimized DNNs align their layer depth to the cortical hierarchy and can predict brain responses to natural images. The authors note that hierarchy has multiple meanings and define three distinct representational forms: compositional (higher-level representations require more nonlinear compositions), entailment (lower-level representations are necessary preprocessing for higher-level ones), and integration (higher-level representations integrate over larger spatial regions/receptive fields). Given these distinctions, the research question is whether accurate encoding models must embody compositional and/or entailment hierarchies, or whether non-hierarchical architectures can achieve comparable accuracy. The authors test this by comparing task-optimized and brain-optimized DNNs with single-branch and multi-branch architectures and evaluating their learned hierarchies relative to fMRI responses in V1–V4.

Literature Review

- Hierarchical processing is a long-standing organizing principle in primate vision; some CNN architectures were inspired by this concept. Prior findings show alignment between DNN layer depth and cortical hierarchy across primate visual areas. Inactivation of V1 reduces activity in downstream areas (V2–V4), supporting certain anatomical or causal hierarchies. - Integration hierarchy (increasing receptive field sizes) is a robust empirical phenomenon across visual areas. - DNN-based encoding models have successfully predicted neural responses to natural images; many works have explored end-to-end system identification and improvements over classical models. - Despite anatomical recurrency, very deep feedforward DNNs can match recurrent models in predicting neural signals in some contexts, suggesting representational hierarchies may not map directly to architectural constraints. - The Natural Scenes Dataset (NSD) enables large-scale modeling with high-resolution 7T fMRI and shared validation images, facilitating cross-subject and cross-model comparisons.

Methodology

- Data: Used the Natural Scenes Dataset (NSD) with 8 subjects, 7T fMRI (1.8-mm voxels), 22–30K trials per subject. Subjects viewed 9–10K natural scenes (mostly 3 repeats), with 1K images shared across subjects for validation. Voxels assigned to V1–V4 via independent pRF retinotopic mapping. An additional session with 284 artificial stimuli (noise, gratings, contrast/phase modulations, etc.) was used for out-of-distribution validation. - Models: Encoding models consist of a feature extractor (DNN) and voxelwise read-out heads. Read-out heads sample features from all layers, specify spatial pooling fields (Gaussian or flexible), apply compressive nonlinearities, and linearly weight features to predict voxel responses. - Task-optimized model: AlexNet pretrained on ImageNet; feature extractor frozen; read-out parameters trained. Two read-out variants: Gaussian pooling fields (gpf) via grid search + ridge regression; flexible pooling fields (fpf) via gradient descent. - Brain-optimized models (GNet): End-to-end optimization to predict brain activity. Two architectures: 1) Single-branch, jointly trained across all areas/subjects (GNet8j-fpf). 2) Multi-branch, area-specific branches trained independently per ROI (GNet8r-fpf). All used flexible pooling fields; a tanh(x)*log(1+|x|) nonlinearity before/after pooling; some lower-layer maps optionally not directly connected to read-out based on hyperparameter selection. - Training: GNet trained via ADAM, alternating updates between feature extractor and read-out heads, interleaving subject batches; global early stopping using 10% held-out set; optional fine-tuning phases (second: feature extractor only; third: read-out only). AlexNet read-outs trained via ridge regression/grid search (gpf) or gradient descent (fpf). A Gabor baseline model was also included. - Evaluation: Prediction accuracy measured as Pearson correlation on held-out validation trials (shared 1K images for natural scenes; artificial stimuli set), voxelwise. Win percentage computed as fraction of voxels where one model outperforms another. Noise ceilings estimated to compute fraction of explainable variance explained. - Hierarchy analyses: - Integration hierarchy: Derived spatial pooling fields; fitted elliptical Gaussians; examined size–eccentricity relationships and expansion across areas. - Compositional hierarchy: Partitioned DNN layers into bottom vs top halves; computed specific and unique contributions of bottom layers to explained variance per area by masking features and comparing accuracies. - Entailment hierarchy: Transfer learning. Trained area-specific reference models; froze feature extractors; retrained read-out heads to predict other areas’ outputs; constructed 4×4 prediction accuracy shift matrices Δ_ij = pref_ij − ptr_ij. Compared matrices to ideal patterns; summarized via an index α (α=1 strict entailment; α=0 no entailment; α<0 reverse). Transfer analysis applied to model outputs (AlexNet, GNet8j, GNet8r) and, separately, to measured brain activity (noting noise considerations). - Output similarity: Correlated model outputs voxelwise across 1000 images for each pair of models.

Key Findings

- Accuracy of brain-optimized vs task-optimized models: - Across subjects, the single-branch GNet model (GNet8j-fpf) outperformed AlexNet-based models for a majority of voxels in V1–V4: more than 68% of voxels per subject for natural images; average win percentage ≈80% (p<1e-4, two-sided t-test). - For artificial stimuli, GNet outperformed AlexNet for more than 76% of voxels; average win percentage significantly higher than for natural stimuli (≈80% win, p<1e-5). In V4, win percentage improved from 62% (natural) to 74% (artificial) (p<0.01, paired t-test). - Explainable variance: With fine-tuning, GNet models explain up to ~78% of explainable variance on average (area-level averages for voxels with ≥5% explainable variance), as little as ~37% in some conditions; improvements indicate the ceiling has not been reached. - Single-branch vs multi-branch brain-optimized models: - GNet8j (single-branch, joint) and GNet8r (multi-branch, ROI-wise) achieved nearly identical prediction accuracy overall. In V1–V3, win percentages were near parity; in V4, GNet8r had ~68% win. Capacity-matched controls found no clear advantage of one class (Fig. S1). - Integration hierarchy: - Both GNet8j and GNet8r recovered known retinotopic organization and showed increasing receptive field sizes with eccentricity and across areas V1→V4 (size–eccentricity relationships fanning out), indicating a robust integration hierarchy in both architectures. - Compositional hierarchy: - Single-branch models (AlexNet and GNet8j) showed monotonic decline in the unique contribution of bottom layers from V1 to V4 (~60%→~30%), with corresponding increase for top layers, consistent with compositional hierarchy (layer depth aligned to V1→V4). - Multi-branch GNet8r showed little decline; bottom layers uniquely contributed ~75% in V1–V3 and >90% in V4, indicating no compositional hierarchy across branches (similar depth suffices across areas). - Entailment hierarchy (transfer learning index α): - AlexNet: α = 1.0 ± 0.2 (near strict entailment). - GNet8j: α = 0.9 ± 0.2 (strong entailment). - GNet8r: α = 0.3 ± 0.1 (closest to non-entailment). - Applied directly to measured brain activity: α = 0.0 ± 0.1 (no entailment), though interpretation is complicated by noise. - Output correlations: - Outputs of GNet8r and GNet8j were highly correlated across areas, even in V4. Correlations to AlexNet were lower and decreased toward higher areas. Differences due to optimization objective (task vs brain) were larger than differences due to hierarchical vs non-hierarchical representations.

Discussion

The study directly tests whether hierarchical representations are required for accurate prediction of human visual cortex activity. Both task-optimized (AlexNet) and brain-optimized single-branch models exhibited compositional and entailment hierarchies and performed well, consistent with prior findings that layer depth aligns with cortical hierarchy. However, an alternative, brain-optimized multi-branch model (GNet8r) achieved comparable predictive accuracy without exhibiting compositional or entailment hierarchies. All models nonetheless expressed a robust integration hierarchy (increasing receptive field sizes from V1 to V4). These results demonstrate that compositional and entailment hierarchies are not necessary for accurate encoding of fMRI responses in V1–V4, and that DNN architecture does not uniquely determine the representational organization learned. The strong output correlations between hierarchical and non-hierarchical brain-optimized models highlight that subtle differences in output can underlie different hierarchical properties, and that architectural resemblance to cortical anatomy is not a prerequisite for predictive success. The findings argue for precise use of the term “hierarchy” and suggest that visual representation gradients (e.g., receptive field size, invariance, complexity) may not imply the presence of compositional or entailment hierarchies. The results also support the view that diverse visual representations across areas may subserve multiple, potentially independent functions rather than a strictly serial computation. Consequently, studying model outputs and their functional geometry, rather than enforcing architectural analogies, may be more fruitful for understanding brain-like visual representations.

Conclusion

This work introduces brain-optimized DNN encoding models that predict human V1–V4 activity with state-of-the-art accuracy and shows that non-hierarchical (multi-branch) architectures can match single-branch models while lacking compositional and entailment hierarchies. Only integration hierarchy (receptive field expansion) consistently emerged across model types. These findings decouple architectural hierarchy from representational hierarchy in successful encoding models, caution against equating DNN layer hierarchies with cortical area hierarchies, and motivate focusing on the structure of learned representations. Future directions include: applying these analyses to higher-temporal-resolution neural data (e.g., spikes) to probe time-varying hierarchies; extending transfer analyses across time points post-stimulus; and leveraging brain-optimized representations to assess their suitability across diverse tasks, potentially requiring multi-objective optimization for brain-like AI systems.

Limitations

- Measurement modality: fMRI’s low temporal resolution collapses dynamic responses into beta values, intermixing feedforward and feedback effects that may obscure time-dependent hierarchical organization. - Noise: Transfer learning applied directly to measured brain activity is sensitive to noise, complicating interpretation of entailment indices. - Incomplete variance explained: Even the best brain-optimized models explain only ~37–78% of explainable variance (area-wise averages for reasonably predictable voxels), indicating substantial room for improvement. - Architectural capacity: Although capacity-matched controls suggest no clear advantage between single- and multi-branch models, differences in parameterization could still subtly affect learning dynamics. - Anatomical inference: Comparable predictive accuracy from architecturally disparate models implies that encoding model success does not straightforwardly map to anatomical connectivity or causality, limiting anatomical conclusions.