Introduction
The natural world exhibits a high degree of object co-occurrence; certain objects frequently appear together in specific contexts. This contextual knowledge aids object recognition, visual search, and predictive behavior. Prior neuroimaging studies have explored the neural basis of contextual knowledge, often operationalizing context as a single dimension (e.g., strength of contextual association). These studies, using univariate analysis, have shown correlations between contextual associations and activity in scene-selective regions like the parahippocampal place area (PPA) and retrosplenial complex (RSC). However, alternative explanations, such as object size or spatial stability, also influence these areas, limiting the clarity of these findings. The current study addresses these limitations by using a multivariate approach to directly investigate contextual representations in the brain. The researchers hypothesize that neural representations reflect the multivariate statistical structure of object co-occurrence in the visual environment. To test this hypothesis, they use a modified word2vec algorithm (object2vec) applied to a large corpus of real-world scenes to generate low-dimensional representations capturing object co-occurrence statistics. These statistical representations are then mapped onto fMRI responses during object viewing to identify brain regions encoding contextual information.
Literature Review
Behavioral studies have consistently demonstrated the influence of context on object recognition, scene perception, and visual search. Contextual knowledge facilitates faster and more accurate identification of objects in familiar settings. Theoretical frameworks suggest that contextual facilitation reflects a predictive coding mechanism, where the brain anticipates upcoming stimuli based on prior experience. Prior neuroimaging studies have attempted to identify the neural mechanisms underlying contextual knowledge, often focusing on univariate analyses of regional brain activity related to the strength of contextual associations. While these studies have shown some correlation between contextual associations and activity in scene-selective areas like the PPA and RSC, findings have been ambiguous due to the confounding influence of other object properties, such as size and spatial stability. Moreover, these studies lack a direct investigation into the multivariate structure of object context. This study aims to build upon previous work by employing a more sophisticated methodology to explicitly model and examine the multi-dimensional structure of object co-occurrence.
Methodology
The study employed fMRI to investigate the neural representations of object context. The researchers first developed object2vec, an adaptation of the word2vec algorithm, to model object co-occurrence statistics. Object2vec was trained on the ADE20K dataset, a large corpus of labeled images. This model generated eight-dimensional embeddings representing the statistical regularities of object co-occurrence. For comparison, they also used a language-based word2vec model trained on a large text corpus. In the fMRI experiment, four participants viewed isolated images of 810 objects from 81 categories (10 tokens per category) presented on textured backgrounds to minimize low-level feature effects. A simple perceptual task (detecting warped objects) was used to ensure attention. Voxel-wise encoding models were used to assess whether fMRI responses could be predicted from the object2vec or word2vec embeddings using a 9-fold cross-validation procedure. This ensured generalization to new semantic categories rather than just new instances within categories. Region of interest (ROI) analyses were performed on the PPA (divided into anterior and posterior segments), RSC, OPA, pFs, LO, and EVC. Whole-brain analyses were also conducted. A behavioral rating task measured real-world size and spatial stability for each object, providing additional regressors for encoding models to compare with object2vec.
Key Findings
The ROI analyses revealed that both object2vec and word2vec significantly predicted fMRI responses in various brain regions. Object2vec showed the highest prediction accuracy in the anterior PPA, indicating a strong link between this region and the representation of visual object contexts. The anterior PPA showed a larger difference in prediction accuracy between object2vec and word2vec compared to the posterior PPA. Both models showed significant prediction accuracy in other scene-selective regions (RSC, OPA) and object-selective regions (pFs). Whole-brain analyses confirmed these findings, showing a cluster of high prediction accuracy for object2vec overlapping with anterior PPA and extending into parahippocampal cortex. A cluster of high prediction accuracy for word2vec was observed in a neighboring region overlapping with pFs and PPA. A preference map showed that object2vec yielded higher prediction accuracies in anterior PPA, while word2vec performed better in lateral ventral visual cortex (pFs, LO). An exploratory analysis showed a positive correlation between voxel-wise differences in scene vs. object selectivity and the differences in prediction accuracy between object2vec and word2vec, suggesting that voxels better explained by object co-occurrence statistics tend to show greater scene selectivity. Comparison with a model based on object spatial properties revealed that object2vec explained more variance in anterior PPA, extending beyond the PPA boundary into parahippocampal cortex.
Discussion
These findings demonstrate that object representations in visual cortex reflect the statistical regularities of object co-occurrence in visual scenes and language. The strong prediction accuracy of object2vec in the anterior PPA confirms its role in linking objects to their visual contexts. The superior performance of word2vec in object-selective regions suggests that these regions encode object properties related to language-based co-occurrence statistics. This distinction may reflect the encoding of different types of semantic associations (thematic vs. taxonomic). The results support the efficient coding hypothesis, where cortical regions are tuned to the natural statistics of sensory input, effectively compressing information into a low-dimensional representation. The overlapping explained variance between object2vec and the spatial properties model indicates a relationship between object co-occurrence statistics and physical object characteristics.
Conclusion
This study provides compelling evidence that the human visual system encodes the statistical regularities of object co-occurrence in both visual scenes and language. The anterior PPA plays a crucial role in representing visual object contexts. The findings support the efficient coding hypothesis in high-level visual processing. Future research could investigate the representation of object locations, the learning mechanisms underlying object co-occurrence representations, and the integration of visual and linguistic models to create richer models of object semantics.
Limitations
The study used a relatively small number of participants (four). The findings may be influenced by the specific datasets used for training the object2vec and word2vec models. The exploratory analysis on category selectivity requires further investigation with a larger sample size. The study focused on object co-occurrence independently of object location, which could be explored in future research.
Related Publications
Explore these studies to deepen your understanding of the subject.