Psychology

Computational models of category-selective brain regions enable high-throughput tests of selectivity

N. A. R. Murty, P. Bashivan, et al.

Discover groundbreaking research by N. Apurva Ratan Murty, Pouya Bashivan, and colleagues as they unveil artificial neural network-based encoding models that predict brain responses to images with unprecedented accuracy, validating domain-specific theories in human cognition. This innovative approach enhances our understanding of how we perceive faces, places, and bodies, paving the way for future explorations in cognitive neuroscience.... show more

Introduction

The study addresses whether classic category-selective regions in human ventral visual cortex—the fusiform face area (FFA), parahippocampal place area (PPA), and extrastriate body area (EBA)—are truly selective for faces, places, and bodies, respectively, when tested against the vast space of possible images. Traditional, word-based definitions of selectivity are not image-computable, lack quantitative precision, and remain susceptible to falsification if non-preferred images are found to maximally drive a region. Leveraging advances in deep convolutional artificial neural networks (ANNs) that approximate primate ventral stream processing, the authors aim to build image-computable encoding models that can quantitatively predict fMRI responses to novel images in FFA, PPA, and EBA, generalize across participants, and enable strong, high-throughput tests of category selectivity by screening and synthesizing images predicted to maximally activate each region. The work further evaluates whether ANN-based models surpass descriptive category models and human expert intuitions in predicting regional responses.

Literature Review

Prior work has established cortical regions selective for faces (FFA), places (PPA), and bodies (EBA), with distinct developmental trajectories and computational roles. Deep ANNs trained on object recognition have approached human-level performance and exhibit internal representations that align with the ventral visual stream hierarchy. Prior studies have shown ANN activations can linearly predict neural responses in monkey and human visual cortex and support representational similarity with cortical data. However, it remained unclear how these models relate to classic category-selectivity theories and whether they provide substantive advances beyond existing descriptive models. The paper situates itself within this literature by combining high-quality fMRI with integrative benchmarking of multiple ANNs, comparing against human novices and experts, and extending to large-scale stimulus screening and synthesis to test selectivity claims.

Methodology

Participants and fMRI: Four adult participants (N=4; 2 females) underwent five scanning sessions on a Siemens 3T Prisma with 32-channel head coil. Functional EPI parameters: TR=2000 ms, TE=30 ms, 2 mm isotropic voxels, 52 slices, flip angle 90°, echo-spacing 0.54 ms, partial Fourier 7/8, whole-brain coverage. High-resolution T1 multi-echo MPRAGE (1 mm isotropic) was acquired.

Localizer: A dynamic localizer identified FFA, PPA, and EBA in each subject via 18-s blocks of short videos from five categories (faces, bodies, scenes, objects, scrambled objects) in palindromic order across five runs.

Event-related main experiment: 185 naturalistic full-field, unsegmented color images (25 faces, 50 bodies, 50 scenes, 65 objects), mostly from THINGS, were presented at 8 dva for 300 ms with jittered ISI 3.7–11.7 s. Each image was repeated ≥20 times per participant across four sessions (~10 h per participant). Ten runs per session; 15 “normalizer” images repeated each session supported across-session normalization.

Preprocessing and GLM: Freesurfer used for slice timing correction, motion correction, alignment to anatomy, and 5 mm FWHM smoothing. GLMdenoise estimated beta parameters per image (optimized for event-related fMRI), improving test–retest reliability. Responses were normalized per 100-image group using the session’s 15 normalizer images (subtract mean, divide by SD).

Encoding models: A screen of 60 computational models (including pixel, V1-like Gabor, and multiple deep ANNs such as AlexNet, ResNets, CORnet variants) was conducted. For each model layer, regularized ridge regression (λ=0.01) mapped model features to fROI mean responses with 5-fold cross-validation, selecting the base-model by highest cross-validated predictivity across six fROIs (left/right FFA, EBA, PPA). Best layer per fROI was chosen using distinct neural data from the homologous contralateral fROI to avoid circularity.

Final mapping used a two-stage linear function: (1) learn a spatial mask over feature map width/height to code “where”; (2) average spatially and apply learned channel weights to code “what.” Two regularization hyperparameters (spatial and channel) were grid-searched over 6 log-spaced values in [0.01, 100], with selection again based on homologous contralateral fROI data. Final performance used 10-fold cross-validation on held-out images.

Generalization tests: Models were evaluated on pooled fROI means across subjects, individual subjects, and across-subject generalization (train on N−1 subjects, test on the held-out subject). Voxel-wise predictivity and population representational dissimilarity matrix (RDM) analyses were also performed.

Behavioral comparisons: Novices (Amazon Mechanical Turk) rated how face-/body-/scene-like each image was via a drag–rate arrangement; reliability threshold test–retest ≥0.85 led to N=106 (face), 115 (body), 60 (scene). Experts (10 senior researchers) predicted FFA/EBA/PPA response magnitudes for each image with high test–retest reliability (~0.96–0.97). Model-to-brain comparisons were cross-validated across both images and subjects for fairness.

High-throughput image screening: Using the encoding models, 3,450,194 images were screened from VGGFace2 (≈599,900 faces), ImageNet (1,281,167 images, 1000 categories), and Places2 (1,569,127 images, ~400 categories). Predicted response distributions were analyzed and the top predicted images were visually inspected: top 5,000 per fROI, and subsampled top-100,000 (2 per thousand; 200 total) per fROI.

Category removal control: For FFA and PPA, labeled face images (VGGFace2) and place images (Places2) were removed; models then reported new top-5,000 images from the remaining pools.

Control simulation: In AlexNet conv5, putative “face units” were localized using the dynamic localizer images; a ResNet-50-based computational model predicted responses of these units to the 185 images; top-predicted images were assessed for being faces.

Image synthesis: Following a differentiable pipeline (BigGAN prior → encoder ResNet-50 → fROI mapping), images were synthesized to maximize predicted fROI activation. BigGAN (256×256) latent z truncated at 0.5; softmax class variable initialized at aS(n) with a=0.05; 30,000 optimization steps with Adam (lr=0.001).

Feature attribution: A RISE-like occlusion method applied 2,000 random masks per image; predicted responses to masked images were combined into importance maps to infer stimulus features driving each fROI’s responses.

Evaluation metrics: Neural predictivity was Pearson correlation between observed and cross-validated predicted responses across images. Noise ceilings were estimated via split-half Spearman–Brown corrected reliabilities. RSA used Spearman correlation between observed and predicted Euclidean-distance RDMs.

Key Findings

ANN encoding models accurately predict image-level responses in FFA, EBA, and PPA. Several deep ANNs achieve high cross-validated neural predictivity (>0.8), surpassing pixel and V1-like models; deeper/recurrent models perform better than shallower ones.
Models trained on broad natural image datasets (ImageNet/Places) outperform models trained on narrow domains (e.g., faces) when architecture is controlled (e.g., ResNet-50 variants).
Trained networks far outperform randomly initialized ones.
ResNet-50-based models yielded strong correlations between predicted and observed responses for each fROI and hemisphere. Within-category predictive power was significant (mean ± s.e.m across fROIs and categories: 0.56 ± 0.03; P=1.19×10^-7), exceeding within-category shuffled controls (P=0.03).
Generalization across subjects: Training on three subjects and testing on the held-out subject achieved average correlations >0.78 for all fROIs (mean ± s.e.m 0.82 ± 0.01; each P<0.00005). Models trained and tested within a single subject also performed highly (mean ± s.e.m 0.83 ± 0.01; each R>0.78). Models trained on one subject generalized to others (mean ± s.e.m 0.79 ± 0.01; each R>0.76; P<0.00005).
Voxel-wise and population RDM analyses: Model predictivity at the voxel level and RDM similarity strongly matched fROI mean-response results (Spearman R=0.99 across models), with fROI means showing higher absolute predictivity due to averaging/SNR.
Models outperform humans: ANN models (mean ± s.e.m 0.82 ± 0.01) exceeded novices (0.64 ± 0.04; P=0.03) and experts (0.77 ± 0.01; P=0.03) at predicting fROI responses. Within-category image-level predictions: ANN 0.39 ± 0.03 vs experts 0.23 ± 0.02 vs novices 0.12 ± 0.03 (P=2.1×10^-5 ANN>experts; P=1.8×10^-5 ANN>novices; P=4.4×10^-5 experts>novices). For hypothesized preferred categories: ANN 0.50 ± 0.04 vs experts 0.17 ± 0.07 vs novices 0.03 ± 0.10 (all P=0.03 pairwise).
Strong tests of selectivity via screening 3.45M images: For each region (FFA/EBA/PPA), all top-predicted 5,000 images were unambiguous members of the hypothesized preferred category; likewise for 200 images sampled from the top 100,000. After removing labeled faces (VGGFace2) and places (Places2), FFA and PPA models still selected only faces and places respectively among new top-5,000.
Control simulation on ANN “face units” (AlexNet conv5): Despite localizer-based selection, 85% of top-predicted images were not faces, demonstrating the method can falsify selectivity in principle and highlighting stronger selectivity in human FFA than these ANN units.
GAN-based synthesis produced naturalistic images predicted to maximally activate each fROI; synthesized images were recognizable as faces (FFA), bodies (EBA), and scenes (PPA).
Feature attribution (RISE): Importance maps implicated eyes and nose for FFA; hands and torsos (not faces) for EBA; side walls and perspective cues for PPA. In complex scenes, models localized expected regions (faces/bodies/background structure) driving each fROI.

Discussion

The results show that deep ANN-based encoding models provide accurate, image-computable predictions of responses in category-selective regions (FFA, EBA, PPA), generalize across individuals, and exceed both descriptive category-based accounts and human experts’ intuitions, especially at image-level granularity within categories. These models enable high-throughput, computationally precise tests of category selectivity over millions of images and via synthesis, substantially strengthening evidence that FFA, PPA, and EBA are indeed selectively driven by faces, places, and bodies, respectively. The ability to outperform experts suggests the models capture nuanced, fine-grained regularities about these regions not captured by word-level theories. Importantly, ANN models complement rather than replace word models, providing image-computable instantiations that interface with broader cognitive, developmental, and evolutionary theories. The close correspondence between fROI mean and voxel-wise/RDM metrics supports the robustness of the approach across representational grains. The work highlights the value of integrative benchmarking across models and fROIs and establishes a framework for testing and refining neurally mechanistic models that can be shared and evaluated across independent datasets and participant populations.

Conclusion

This study delivers high-accuracy, image-computable encoding models for FFA, EBA, and PPA that generalize across participants and outperform human experts and descriptive models. Leveraging these models for large-scale image screening and GAN-based synthesis provided the strongest tests to date of category selectivity, with top predicted images consistently matching each region’s hypothesized preferred category. The approach also enables interpretable insights into stimulus features driving each region. Future work should improve within-category predictions, incorporate biologically inspired constraints and training regimens, extend to diverse and abstract stimulus domains, and model the full sequence of computations across the ventral stream, supported by community-wide integrative benchmarking platforms. These advances will further bridge computational models with cognitive-level theories and ultimately elucidate how visual representations are read out for behavior.

Limitations

Within-category prediction gaps indicate room for model improvement.
Models trained on natural images may not generalize well to abstract or symbolic stimuli (e.g., line drawings, cartoons, contextually defined faces) and are vulnerable to adversarial perturbations.
Potential non-independence between homologous hemispheric data could slightly inflate predictivity during model/layer selection, though cross-subject validations mitigate this for human comparisons.
Small fMRI sample size (N=4) limits generalizability, despite strong cross-subject replication.
Focus on three fROIs; broader coverage of ventral stream and other modalities would strengthen conclusions.
ANN models provide predictive mappings but not yet full mechanistic accounts across all processing stages.
High-throughput screening relies on available datasets; while synthesis expands the space, empirical validation of synthesized counterexamples was not possible because none were found.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

Accelerating the discovery of active and selective CO2RR catalysts using a high-throughput virtual screening strategy

D. H. Mok, H. Lee, et al.

Chemistry

High-throughput computational-experimental screening protocol for the discovery of bimetallic catalysts

B. C. Yeo, H. Nam, et al.

Engineering and Technology

High-throughput design of high-performance lightweight high-entropy alloys

R. Feng, C. Zhang, et al.

Biology

Unraveling the Neural Network: Identifying Temporal Labeling of Visual Events through EEG-Based Functional Connectivity Analysis of Brain Regions

S. Khoonbani and H. Ramezanian

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny