Introduction
Deep neural networks (DNNs) trained on computer vision tasks produce representations that accurately predict human brain responses to visual stimuli. One hypothesis explaining this success is that DNNs implement a hierarchy of visual processing stages mirroring the hierarchical processing believed to be a crucial organizing principle in primate vision. Support for this comes from the alignment of layer depth in task-optimized DNNs with the hierarchical progression of visual maps in primate cortex. However, this study aimed to thoroughly investigate whether non-hierarchical models could achieve comparable predictive accuracy. The concept of "hierarchy" is multifaceted, encompassing compositional (higher-level representations requiring more nonlinear processing steps), entailment (lower-level representations acting as preprocessing stages for higher-level ones), and integration (spatial integration windows expanding from lower to higher areas) hierarchies. The study tested the necessity of each hierarchy for accurate brain activity prediction using state-of-the-art (SOTA) accuracy.
Literature Review
Previous research has shown a correlation between the depth of layers in task-optimized DNNs and the hierarchical organization of visual areas in the primate brain. This has led to the hypothesis that hierarchical representations are crucial for accurately predicting brain activity. However, this study challenges this notion by investigating whether non-hierarchical models can achieve comparable predictive accuracy. Several studies have explored the use of DNNs to model brain activity, with some focusing on task-optimized networks like AlexNet and others on brain-optimized networks trained directly on fMRI data. The existing literature provides a foundation for comparing the performance of these different approaches and evaluating the role of hierarchical representations in the visual system. The study builds upon these previous findings by explicitly testing the three different types of hierarchies defined in the paper.
Methodology
Three distinct network-based encoding models of human visual areas V1–V4 were tested. One model used a task-optimized AlexNet pre-trained on ImageNet. The other two models used brain-optimized DNNs with different architectures: a single-branch DNN trained jointly on all four visual areas and a multi-branch DNN with each branch trained independently to predict activity in a single visual area. Each encoding model consisted of a DNN and a read-out head that transformed DNN activity into predictions of brain activity measured in individual voxels. The Natural Scenes Dataset (NSD), containing fMRI BOLD responses to thousands of natural scenes from eight subjects, was used for training and validation. Model prediction accuracy was assessed by correlating predicted activity with measured activity for held-out images. The study also included an experiment with artificial stimuli (gratings, noise, etc.) to validate and compare the models. To investigate the three types of hierarchies, analyses examined the contribution of different layers to prediction accuracy (compositional hierarchy), the transferability of representations across areas (entailment hierarchy), and the spatial pooling fields (integration hierarchy).
Key Findings
Brain-optimized GNet models (single-branch) significantly outperformed the AlexNet model in predicting brain activity for both natural and artificial stimuli in V1–V4, exceeding the AlexNet model's accuracy for over 68% of voxels in V1–V4 for natural stimuli and over 76% for artificial stimuli. Both single-branch (GNet8j) and multi-branch (GNet8r) brain-optimized networks exhibited similar prediction accuracy. Both models showed an integration hierarchy, characterized by an expansion of receptive field size with progression from V1 to V4. Single-branch models (AlexNet and GNet8j) showed evidence for compositional and entailment hierarchies (lower layers contributed more to V1, and representations in anterior areas were more transferable to posterior areas). In contrast, the multi-branch GNet8r model did not demonstrate these hierarchies. The outputs of models with and without hierarchical representations were strongly correlated. Analysis of the prediction accuracy shift matrix indicates that the GNet8r model did not show an entailment hierarchy.
Discussion
The findings challenge the notion that hierarchical representations are essential for accurate prediction of brain activity in V1–V4. The high accuracy achieved by the multi-branch GNet8r model, which lacks compositional and entailment hierarchies, suggests that these hierarchies are not necessary. The observed integration hierarchy, however, appears to be a common feature across all models, possibly representing a visual representation gradient. The study highlights the importance of precisely defining "hierarchy" when discussing visual representations. The results also suggest a more tenuous relationship between DNN architecture and cortical anatomy than previously assumed. While single-branch architectures may enforce correlations between different types of hierarchies, the success of multi-branch models indicates that these correlations are not essential for accurate brain activity prediction. This suggests that the focus should shift from architecture to the analysis of model outputs, treating them as approximations to brain activity that can be analyzed for functional and geometric properties that improve predictive accuracy.
Conclusion
This study demonstrates that hierarchical representations are not necessary for accurate prediction of brain activity in human visual areas V1–V4. Brain-optimized networks with different architectures achieved comparable accuracy, highlighting the flexibility of DNNs in modeling brain activity. Future research should focus on analyzing the functional properties of model outputs to identify features that enhance predictive accuracy and explore the dynamic aspects of hierarchies using higher temporal resolution brain activity measures. The findings suggest the possibility that diverse representations in visual areas serve distinct, independent functions, implying that future AI systems might need to simultaneously optimize for multiple task objectives to better reflect brain-like visual representations.
Limitations
The study uses fMRI data, which has limitations in temporal resolution. This may intermix feedforward and feedback effects, potentially influencing the assessment of hierarchies. While the study addresses this limitation by analyzing the overall BOLD response, future research with higher temporal resolution data could provide a more detailed understanding of the dynamics of hierarchical processing. The study focused on specific types of hierarchies, and other types of hierarchies may exist and contribute to visual processing. The current analyses are limited to the identification of spatial receptive fields from the trained model weights. More detailed analysis methods should be considered.
Related Publications
Explore these studies to deepen your understanding of the subject.