Psychology

Object representations in the human brain reflect the co-occurrence statistics of vision and language

M. F. Bonner and R. A. Epstein

This groundbreaking study by Michael F. Bonner and Russell A. Epstein uncovers how our brain encodes object co-occurrence statistics derived from visual environments and language. Through advanced machine learning and fMRI techniques, they reveal that our cortical responses to objects are intricately linked to the contexts in which those objects typically appear, shedding new light on our visual experiences.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates whether and how the human visual cortex encodes the statistical regularities of object co-occurrence found in natural environments. Contextual knowledge about which objects tend to appear together (e.g., kettles with mugs in kitchens) facilitates recognition and search, suggesting predictive mechanisms in perception. Prior neuroimaging implicated scene-selective regions (PPA, RSComp) in contextual processing, but prior approaches were largely univariate and one-dimensional (strength of contextual association), confounded by other object properties (e.g., size, spatial stability), and did not capture which objects are associated with which. The authors hypothesize that multivariate cortical responses to single objects will reflect low-dimensional statistical structure of object co-occurrence and that such contextual representations will be elicited even when objects are presented in isolation. They further test whether visual co-occurrence statistics (from images) and linguistic co-occurrence statistics (from text) map onto distinct cortical regions.

Literature Review

Earlier work showed contextual associations modulate activity in scene-selective regions (PPA, RSComp), typically using one-dimensional ratings of contextual association strength. However, similar variance can be explained by real-world size and spatial stability, and some contextual effects have been inconsistent. Topic-model approaches on scenes suggest statistical scene categories predict cortical responses (Stansbury et al.). Work comparing visual and linguistic co-occurrence showed partial correlation but also divergences, with visual statistics capturing cross-category associations less evident in language (Sadeghi et al.). Other findings demonstrate that scene regions encode multiple high-level object properties and are sensitive to mid-level image features, complicating interpretations of their selectivity. This background motivates modeling the latent multivariate structure of object co-occurrence and testing its neural correlates with isolated object stimuli.

Methodology

Models: The authors developed object2vec, adapting the word2vec CBOW algorithm (fastText) to image annotations to learn low-dimensional embeddings that capture object co-occurrence in scenes. Training corpus was ADE20K with 22,210 annotated scenes and 2,959 cleaned unique object labels; each image provided a bag of object labels (no repeats), context defined as all other objects in the same image. Embeddings were initially 10-D across 100 random initializations (1000 epochs, negative sampling 20); representational similarity analyses indicated high correspondence across dimensionalities. PCA showed 8 components explained over 90% variance; the final object2vec representation used 8 PCs derived across initializations. For language, 300-D word2vec embeddings (Google News, ~100B words) were filtered to WordNet vocabulary and reduced via PCA to 30 PCs (elbow criterion ~30); for each of 81 object categories, embeddings for associated names (singular/plural/synonyms from ADE20K labels) were averaged. Stimuli and fMRI: 810 images of isolated objects (81 categories × 10 exemplars) were displayed on optimized complex textured backgrounds to reduce correlations with low-to-mid-level CNN features (AlexNet), created via DeepDream-based composites and a stochastic assignment optimization to minimize RSA correlations between CNN layers and object2vec/word2vec RDMs. Warped versions (diffeomorphic transforms) served as targets in a category-detection task to ensure attention without explicit context retrieval. fMRI acquisition used Siemens 3T Prisma, 2 mm isotropic voxels, multiband factor 3; mini-block design (5 images per block, 500 ms each with 500 ms ISI, 4.5 s blocks, inter-block intervals and null events). Four subjects participated across multiple sessions. Preprocessing in SPM12: realignment, coregistration, MNI warping, 6 mm smoothing; GLM estimated voxel-wise betas per category; betas z-scored within runs; split-half reliability computed (odd vs even runs) and a reliability mask applied (r ≥ 0.1841). ROIs: Functionally defined scene-selective PPA, OPA, RSComp (scenes > objects), object-selective LO and pFs (objects > scrambled), and EVC (scrambled > scenes); ROIs formed by top 50 voxels per hemisphere within parcels passing reliability threshold; PPA also segmented into anterior and posterior thirds (25 voxels per hemisphere each). Encoding models: Voxel-wise ordinary least squares regression mapping model features (object2vec 8 PCs, word2vec 30 PCs, or spatial property ratings) to category responses; 9-fold cross-validation aligned with stimulus assignment (distinct category folds per run); out-of-sample prediction accuracy computed as Pearson correlation between predicted and actual responses across all categories. Nuisance regressors capturing low-level features were derived from AlexNet convolutional layers (layers 1–5 outputs averaged across tokens per category, concatenated and reduced via PCA to 20 PCs); these nuisance PCs were included only during training to estimate weights but not applied when generating predictions for test folds. Statistical significance assessed via permutation tests (5000 permutations shuffling category labels within folds), voxel-wise and ROI-wise, with FDR correction for voxel-wise maps. Whole-brain group maps averaged across subjects. Preference maps contrasted models by subtracting prediction accuracies (negatives set to zero before differencing), with permutation-based significance and FDR. Spatial properties model: Behavioral ratings for real-world size (1–8 scale with reference objects) and spatial stability (1–5 frequency of position change) were collected on Mechanical Turk (high inter-rater reliability), and used as regressors in separate encoding models. Exploratory analyses: Principal component analysis on voxel-wise regression weights for significant voxels to visualize tuning axes for object2vec and word2vec.

Key Findings

- Both object2vec (vision-based co-occurrence) and word2vec (language-based co-occurrence) significantly predicted fMRI responses to isolated objects in PPA, with strongest effects for object2vec in anterior PPA. Interaction showed a greater advantage of object2vec over word2vec in anterior versus posterior PPA (permutation test p = 0.002). Exact ROI p-values in Fig. 4: object2vec anterior PPA p = 2.0e-04; posterior PPA p = 2.4e-03. word2vec anterior PPA p = 4.0e-04; posterior PPA p = 2.0e-04. - Across ROIs (Fig. 5): object2vec significantly predicted responses in scene-selective OPA (p = 4.0e-04), PPA (p = 2.0e-04), RSComp (p = 2.0e-04), and object-selective pFs (p = 8.0e-04); not significant in EVC (p = 3.8e-01) or LO (p = 4.0e-01). word2vec significantly predicted all ROIs including EVC (p = 2.4e-03), LO (p = 2.0e-04), pFs (p = 2.0e-04), OPA (p = 2.0e-04), PPA (p = 4.0e-04), RSComp (p = 2.0e-04), with relatively larger effects in LO and pFs. - Whole-brain maps (Fig. 6) showed object2vec prediction accuracy peaking in right anterior PPA and adjacent parahippocampal cortex, with additional clusters in OPA and RSComp. word2vec had prominent clusters adjacent to and overlapping the object2vec cluster but extending laterally into ventral temporal cortex (pFs) and lateral occipital cortex (LO), plus clusters in OPA and RSComp. - Direct comparison (preference map, Fig. 7A): object2vec > word2vec in a right anterior PPA cluster; word2vec > object2vec in more lateral ventral regions overlapping pFs and LO. - Relationship to category selectivity (Fig. 7B): Voxel-wise differences in model accuracy (object2vec − word2vec) correlated with scene−object selectivity differences in 3 of 4 subjects (Pearson r = 0.51, 0.20, −0.17, 0.29), suggesting voxels better predicted by object2vec tend to be more scene-selective, whereas voxels better predicted by word2vec tend to be more object-selective (exploratory, not consistent across all subjects). - Spatial properties model (real-world size, spatial stability) significantly predicted responses in scene-selective ROIs (PPA, OPA, RSComp; all p = 2.0e-04) and in pFs (p = 2.0e-04), weaker/NS in EVC (p = 1.4e-01) and LO (p = 8.7e-01). Preference mapping (Fig. 9): spatial properties > object2vec across much of posterior PPA, OPA, RSComp, whereas object2vec > spatial properties in a cluster overlapping anterior PPA and extending into parahippocampal cortex beyond PPA. - PCA of voxel tuning (Fig. 10): Principal components of regression weights revealed broad indoor–outdoor distinctions and finer thematic structure (e.g., electronics/appliances vs other indoor items; natural vs man-made outdoor elements), with mixed selectivity consistent with encoding diverse contextual associations in a low-dimensional space.

Discussion

Findings support the hypothesis that visual cortex encodes the latent statistical structure of object co-occurrence: viewing single, isolated objects elicited responses predictable from the statistical ensembles in which those objects typically appear. This contextual coding localized most strongly to anterior PPA and neighboring parahippocampal cortex, indicating a key role for these regions in linking objects to their visual contexts, beyond mere spatial properties. In contrast, linguistic co-occurrence (word2vec) better predicted responses in object-selective cortex (pFs, LO), suggesting that distributional structures from language map preferentially onto object-selective regions, potentially reflecting taxonomic/shape-related or abstract semantic information. Together, the results imply partially distinct but overlapping cortical mappings of visual and linguistic regularities in object processing. The results also align with efficient coding accounts in which high-level visual regions compress behaviorally relevant natural statistics into low-dimensional representations, explaining overlapping variance between contextual statistics and spatial properties (e.g., real-world size). Exploratory analyses indicate that voxels’ preference for visual vs linguistic co-occurrence relates to scene vs object selectivity, though not uniformly across subjects.

Conclusion

This work introduces object2vec, a low-dimensional representation of visual object context learned from annotated natural scenes, and shows that these visual co-occurrence statistics predict object-evoked responses in scene-selective cortex, particularly right anterior PPA and adjacent parahippocampal cortex. Language-based word2vec better predicts responses in object-selective regions, revealing a dissociation between how visual and linguistic regularities are reflected in cortical object representations. The study supports the view that high-level visual cortex encodes efficient, statistically grounded representations that integrate contextual associations. Future research directions proposed include: modeling natural statistics of object spatial locations and typical positions; investigating how contextual statistics are learned and potentially transferred from medial temporal memory systems to visual cortex; and integrating image-based and language-based embeddings to build richer multimodal models of object semantics.

Limitations

- Small sample size (n=4), focusing on extensive within-subject data; generalizability across larger populations remains to be established. - Some analyses were post hoc and exploratory (e.g., voxel-wise correlation between model preference and category selectivity; PCA interpretations), with trends not consistent across all subjects. - ROI p-values reported as uncorrected in figures for some tests; whole-brain voxel-wise maps used FDR correction, but caution is warranted in interpreting multiple comparisons across analyses. - The approach models co-occurrence at the category level using annotations (bag-of-objects) and does not incorporate spatial arrangement statistics; thus, effects of positional regularities were not tested. - The linguistic vs visual dissociation cannot disentangle whether word2vec predictions in object-selective cortex reflect abstract semantics vs perceptual correlates (e.g., shape similarities).

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Temporally organized representations of reward and risk in the human brain

V. Man, J. Cockburn, et al.

Interdisciplinary Studies

Brain-computer interfaces and human factors: the role of language and cultural differences—Still a missing gap?

C. Herbert

Psychology

Distance and grid-like codes support the navigation of abstract social space in the human brain

Z. Liang, S. Wu, et al.

Computer Science

Shared functional specialization in transformer-based language models and the human brain

S. Kumar, T. R. Sumers, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny