Psychology
High-level visual prediction errors in early visual cortex
D. Richter, T. C. Kietzmann, et al.
Predictive processing proposes that the brain continuously generates top-down predictions and compares them to sensory inputs, with mismatches yielding prediction errors that guide perceptual inference. Prior neuroimaging and electrophysiological work shows larger responses to unexpected than expected stimuli (expectation suppression), but it is unknown what kind of surprise (low-level vs high-level visual features) is encoded in prediction errors across the visual hierarchy. The central questions are: (1) Do visual prediction errors reflect feature tuning in a local, area-specific manner (e.g., low-level features in V1, high-level features in higher visual cortex)? or (2) Are prediction errors computed at higher levels and their surprise signal broadcast to earlier visual areas, implying top-down inheritance of high-level feature tuning? To address this, the study combines fMRI with representational dissimilarities derived from a visual DNN to quantify low- (early layers) and high-level (late layers) visual feature surprise and tests how prediction error magnitudes scale with each across the ventral visual stream.
- Expectation suppression (reduced activity for expected vs unexpected inputs) has been reported throughout the ventral visual stream and across species and modalities, often interpreted as larger prediction errors for unexpected stimuli within a predictive processing framework.
- Feature-specific prediction error accounts suggest local tuning: V1 for orientation/edges/contrast; higher visual cortex (HVC) for complex object features and categories. Indirect evidence exists for tuning-specific modulations, but direct tests of the feature content of prediction errors are scarce.
- Top-down inheritance accounts propose that predictions generated in higher areas are fed back to earlier areas, potentially causing prediction error tuning in early areas to reflect higher-level features. In macaques, identity-level surprise modulates responses in lower-level face areas, implicating feedback from IT.
- DNNs provide layer-wise feature spaces that map onto the ventral stream: early layers capture low-level features; late layers capture high-level, object-like representations. Prior RSA work shows alignment between DNN layers and cortical gradients.
- Open issues include whether prediction errors in humans primarily reflect low-level or high-level visual surprise, and whether any such effects generalize beyond faces and single-domain stimuli.
Participants: 33 healthy, right-handed adults (21 female; mean age 23.8 ± 4.5 years). Data from 7 of 40 recruited were excluded (incomplete data, MRI quality issues, behavioral outliers).
Design and task: On each trial, a letter cue (500 ms) probabilistically predicted a specific image (500 ms), followed by a variable ITI (~5 s; 3–12 s). Each of 8 images per participant was associated with one letter cue; the expected image was 7× more likely than any unexpected image given its cue (TPM: 7:1 against each of the 7 others). Participants performed an animate/inanimate classification on image onset (max RT 1,500 ms). To promote learning and ensure attention to cues, vowels served as no-go cues (no response); no-go trials did not enter analyses.
Stimuli: From a database of 233 natural color images, outliers were removed via hierarchical clustering in DNN layers; 213 remained. For each participant, 8 images (4 animate, 4 inanimate) were selected to maximize within-layer RDM variance (layers 2 and 8 of AlexNet trained on ecoset) and minimize between-layer RDM correlation. Letter–image mappings were randomized per participant.
Procedure: Two sessions (8 fMRI runs total; 4 per session). Each fMRI run had 128 trials: 56 expected (8 pairs × 7 reps), 56 unexpected (8×7 unique cue–unexpected pairs), and 16 no-go. Additional behavioral blocks outside the scanner (longer exposure to regularities) were included. A prediction-free functional localizer (two runs total: start of day 1, end of day 2) presented each image in 12 s miniblocks (500 ms on/300 ms off), including phase-scrambled controls; task: detect a brief brightness increase.
MRI acquisition: Siemens 3T Prisma/PrismaFit, 32-channel head coil. Functional T2* multiband-6 EPI (TR/TE = 1,000/34 ms, 66 slices, 2 mm isotropic voxels, A/P PE, flip angle 60°). T1 MP-RAGE (1 mm isotropic). Preprocessing with fMRIPrep 22.1.0 (motion correction, slice timing, coregistration, normalization to MNI152NLin2009cAsym, CompCor nuisance regressors); additional high-pass (128 s) and 5 mm FWHM smoothing.
DNN feature models: AlexNet trained on ecoset. Representational dissimilarities (1 − correlation) were computed for all layers and averaged across 10 network instances. Layer 2 represented low-level visual features; layer 8 (pre-softmax) represented high-level features. Controls: layer 8 of an untrained (random) AlexNet; animacy category; semantic word category dissimilarity via word2vec embeddings (Google News 300-d) of category labels.
Localizer RSA: Searchlight RSA assessed alignment of cortical RDMs with DNN layer RDMs during prediction-free blocks. The best-explaining DNN layer per voxel was identified (Kendall’s Tau; threshold z > 3.1). Expected gradient: early cortex aligned to early DNN layers, higher regions to later layers.
Univariate fMRI GLM (main task): Event-related GLMs modeled expected and unexpected image onsets (500 ms). Parametric modulators were added only to unexpected trials: z-scored dissimilarity of the seen unexpected image relative to the trial’s expected image in DNN layer 2 (low-level) and layer 8 (high-level). Control modulators: animacy category (same/different vs expected), word2vec semantic distance, and untrained random layer 8. Nuisance regressors included motion, FD, CSF, WM. Cluster correction used GRF with cluster-forming z ≥ 3.29 and cluster p < 0.05.
ROI definitions and analyses: Three a priori ROIs—V1 (primary visual cortex), LOC (object-selective lateral occipital complex), and HVC (higher visual cortex; occipito-temporal/fusiform)—were defined anatomically and refined functionally. LOC was constrained by intact > phase-scrambled contrast from the localizer. Within each ROI, the 200 most stimulus-informative voxels (searchlight SVM decoding during the localizer) were selected; control analyses varied ROI size (100–500 voxels) and used alternative stimulus-driven masks. Parameter estimates for modulators were extracted and tested against zero and pairwise contrasted (FDR-corrected across ROIs/modulators). Variance inflation factors were computed to assess collinearity.
Whole-brain regression across DNN layers: Single-trial beta estimates (least squares separate) for unexpected trials were regressed on dissimilarity from each DNN layer (1–8) to map which layer best explained prediction error scaling (color-coded by highest explained variance; liberal threshold z ≥ 1.96).
Decoding as a function of surprise: Multi-class SVM trained on localizer single-trial betas to decode the 8 object identities. Tested on main-task unexpected trials to obtain per-trial true class probabilities; regressed true class probability on layer 8 dissimilarity within each ROI.
Behavioral analyses: One-way repeated-measures ANOVAs and post hoc tests compared RTs and accuracies across conditions (expected, unexpected-same response, unexpected-different response), with Holm–Bonferroni corrections. Outlier exclusion criteria applied to behavioral and MRI quality metrics.
Behavior
- Participants used predictive regularities, showing behavioral facilitation. RT ANOVA: F(1.4,43.3) = 23.5, p < 0.001, η = 0.42; Accuracy ANOVA: F(1.4,46.3) = 7.8, p = 0.003, η = 0.20.
- RTs: expected (501 ms) < unexpected-same (509 ms), t(32) = 2.41, p = 0.019, d = 0.14; expected (501 ms) < unexpected-different (524 ms), t(32) = 6.77, p < 0.001, d = 0.39; unexpected-same < unexpected-different, t(32) = 4.36, p < 0.001, d = 0.25.
- Accuracy: expected > unexpected-different, t(32) = 3.75, p = 0.001, d = 0.61; expected vs unexpected-same not significant; unexpected-same > unexpected-different, t(32) = 2.97, p = 0.008, d = 0.48. Overall accuracy >95%.
Localizer RSA (prediction-free)
- Clear ventral-stream gradient: early visual cortex best explained by early DNN layers; fusiform/HVC by later layers, validating layer 2 (low-level) and layer 8 (high-level) feature models for these stimuli.
Prediction error scaling (main imaging results)
- Whole-brain parametric modulation: Prediction error magnitudes (unexpected trials) scaled positively with high-level visual dissimilarity (layer 8) across visual cortex (EVC through HVC; example cluster size 779 voxels; 6,232 mm³). No significant modulation by low-level dissimilarity (layer 2) anywhere in visual cortex.
- ROI parametric modulation: • V1: Layer 8, t(32) = 6.79, p < 0.001, d = 1.18; Layer 2, W = 190, p = 0.159; Layer 8 > Layer 2: t(32) = 8.05, p < 0.001, d = 1.40. • LOC: Layer 8, W = 126, p = 0.017, d = 0.55; Layer 2, t(32) = -0.68, p = 0.504; Layer 8 > Layer 2: t(32) = 4.24, p < 0.001, d = 0.74. • HVC: Layer 8, t(32) = 2.59, p = 0.029, d = 0.45; Layer 2, t(32) = -0.70, p = 0.586; Layer 8 > Layer 2: t(32) = 2.85, p = 0.008, d = 0.50.
- Across DNN layers (whole-brain single-trial regression): Most voxels in EVC, LOC, and HVC showed largest effects for late layers (7–8), indicating preferential scaling of prediction errors with high-level visual surprise. Minor clusters reflected intermediate/early layers.
- Shape of scaling (ROI regression of BOLD on layer 8 dissimilarity): positive monotonic relationships in all ROIs: V1 t(32) = 9.35, p < 0.001 (mean r = 0.09); LOC t(32) = 3.27, p = 0.004 (mean r = 0.03); HVC t(32) = 2.05, p = 0.049 (mean r = 0.02).
Decoding as a function of surprise
- True class probability increased with high-level (layer 8) dissimilarity in V1, t(32) = 5.50, p < 0.001, d = 0.96 (mean r = 0.08), and HVC, t(32) = 3.89, p < 0.001, d = 0.68 (mean r = 0.05); not significant in LOC, t(32) = 0.96, p = 0.342.
Controls and robustness
- High-level (layer 8) model outperformed low-level (layer 2), animacy category, word2vec semantic distance, and untrained random layer 8 in whole-brain contrasts and ROI analyses (strongest in V1; also evident in LOC/HVC). Negative modulations observed for word2vec in V1 (p = 0.020, d = −0.60) and for animacy in LOC (p = 0.039, d = −0.51).
- Collinearity was low (VIFs < 5): layer 2 VIF = 1.42; layer 8 VIF = 1.88, arguing against variance partitioning artifacts.
- Control ROIs with stimulus-uninformative voxels showed no modulation by any model, indicating specificity to stimulus-selective voxels.
- Findings were robust across ROI definitions and sizes.
Expectation suppression (complementary)
- Unexpected > expected showed canonical prediction-error-related activations in visual cortex and additional regions (e.g., anterior insula, inferior frontal gyrus), consistent with prior literature.
The study resolves a key ambiguity in predictive processing accounts by demonstrating that the magnitude of visual prediction errors scales with high-level visual feature surprise, not low-level feature mismatch, across the ventral stream—including V1. Despite V1’s feedforward tuning for low-level features (confirmed in the prediction-free localizer), its prediction-error-related modulation followed high-level feature distances, supporting a top-down inheritance account where high-level predictions or surprise signals computed in later visual areas are broadcast to earlier areas. This provides converging human fMRI evidence consistent with macaque studies and indicates that predictive modulations can rapidly arise after brief statistical learning. Functionally, higher high-level surprise not only increased BOLD responses but also enhanced decodability of object identity in V1 and HVC, suggesting that larger mismatches trigger stronger and sharper representational states, potentially reflecting extended recurrent processing to resolve recognition under violated predictions. Control analyses ruled out confounds from task-relevant animacy, semantic (word-level) surprise, DNN architectural biases, and global arousal/attention explanations for the observed scaling profile. The absence of parametric surprise scaling outside visual cortex in the main contrasts further emphasizes a perceptual inference locus. The results indicate that predictive top-down signals shape early sensory processing by constraining interpretations based on high-level expectations. While flexible tuning is plausible (e.g., different tasks/stimuli emphasizing low-level features might elicit low-level prediction-error scaling), for naturalistic stimuli with hierarchical structure, high-level predictions appear to dominate. These findings reinforce hierarchical predictive processing frameworks by identifying the feature level of surprise encoded in prediction errors and by dissociating feedforward tuning from predictive modulation in early visual areas.
This work shows that visual prediction errors across the ventral stream, including primary visual cortex, scale monotonically with high-level visual feature dissimilarity between expected and observed inputs. In prediction-free contexts, cortical representations follow a classic low-to-high feature gradient; under predictive contexts, modulation of responses reflects high-level surprise even in early areas, indicating top-down inheritance/broadcast of surprise signals. These findings strengthen hierarchical predictive processing theories by specifying that high-level predictions constrain early sensory processing to support perceptual inference. Future directions include: (1) testing the flexibility of prediction-error tuning under tasks emphasizing low-level features or different stimulus classes; (2) employing improved low-level and semantic models to probe alternative feature spaces; (3) examining temporal dynamics with methods offering higher temporal resolution; (4) identifying sources and mechanisms of feedback (e.g., higher visual areas, perirhinal cortex) and their roles in learning and attentional gain control; and (5) exploring individual differences and neuromodulatory influences on predictive signaling.
- Feature model choice: Although early DNN layers explained EVC responses in the localizer, a different or more refined low-level model or stimulus set might have revealed low-level prediction-error scaling. However, neither whole-brain nor ROI analyses showed subthreshold positive effects for low-level surprise.
- DNN layer correlations: Neighboring layers share variance, limiting simultaneous inclusion of many layers in a single GLM; primary analyses focused on layers 2 (low-level) and 8 (high-level). Control analyses with adjacent layers yielded qualitatively similar results.
- Semantic model limits: word2vec may not be state-of-the-art; superior semantic feature models or tasks could reveal semantic surprise effects, though such effects are less expected in early visual cortex.
- Interpretability of DNN features: Precise features are difficult to specify; visualization/decoding suggest layer 2 (orientation-like) vs layer 8 (texture/category-like), but handcrafted feature models might further refine interpretations.
- ROI differences: Larger effects in V1 than LOC/HVC could reflect functional sensitivity or neurovascular/SNR differences; causality cannot be inferred.
- Attention/arousal: While analyses mitigate attention-based accounts (unexpected-only comparisons, control models), generic arousal contributions cannot be entirely ruled out; attention likely gates predictive effects.
- Generalizability: Findings pertain to naturalistic stimuli with hierarchical structure; tasks prioritizing low-level features may yield different tuning profiles.
Related Publications
Explore these studies to deepen your understanding of the subject.

