Psychology
High-level visual prediction errors in early visual cortex
D. Richter, T. C. Kietzmann, et al.
Predictive processing theories posit that the brain minimizes prediction errors by comparing top-down expectations to bottom-up inputs across a cortical hierarchy. A central open question is what kind of visual features—low-level (e.g., edges, orientation) versus high-level (e.g., object identity, parts, textures)—are reflected in prediction error signals at different stages, particularly in early visual cortex (V1). Two hypotheses are contrasted: (1) prediction errors are locally tuned and mirror the feature preferences of each area; (2) prediction errors are computed at higher levels and broadcast downward, leading earlier areas to reflect high-level surprise. The study investigates whether visual prediction errors scale with low-level or high-level visual feature dissimilarity after statistical learning, and whether early visual cortex inherits high-level tuning via feedback.
Prior work shows expectation suppression—attenuated responses to expected stimuli—across the ventral stream and modalities, often interpreted as reduced prediction error (e.g., Egner et al., Kok et al., Richter & de Lange). Feature-specific modulation has been observed, but direct evidence on which surprise features are encoded is sparse. Studies in macaques indicate top-down inheritance, with low-level areas showing sensitivity to high-level changes (e.g., identity-related surprise in ML; Schwiedrzik & Freiwald). DNNs have demonstrated explanatory power for representational gradients in the ventral stream, suggesting alignment of early layers with low-level features and later layers with high-level features. Recent findings in V1 also link responses to high-level predictability. Collectively, these motivate testing whether prediction errors scale with high-level versus low-level visual feature dissimilarity and whether this extends to human early visual cortex.
Participants: 33 healthy, right-handed adults (21 female, mean age 23.8 ± 4.5). Exclusions: 2 incomplete datasets, 2 due to MRI quality, 3 due to behavioral performance. Ethics: approved by CMO Arnhem-Nijmegen (2014/288). Task and stimuli: Naturalistic full-color images (8 per participant: 4 animate, 4 inanimate) drawn from a curated set of 213 images. Each trial: a predictive letter cue (500 ms) followed by an image (500 ms). The transitional probability matrix (TPM) associated each cue with one image, with the expected image 7× more likely than any other (per run: 56 expected trials, 56 unexpected trials, 16 no-go). Participants categorized each image as animate/inanimate; responses withheld on vowel cues (no-go). Intertrial interval: mean ~5,000 ms (3,000–12,000 ms). Eight fMRI runs across two sessions; additional behavioral blocks (longer TPM exposure) to aid learning. Functional localizer: block design flashing a single image (500 ms on, 300 ms off) for 12,000 ms per miniblock; independent data for RSA, ROI refinement (LOC object-selective via intact > scrambled), and decoding voxel selection. MRI acquisition: Siemens 3T Prisma/PrismaFit, 32-channel coil; T2*-weighted multiband-6 EPI (TR=1,000 ms, TE=34 ms, 66 slices, 2 mm isotropic, A/P phase encoding), anatomical T1w MP-RAGE (1 mm isotropic). Preprocessing: fMRIPrep 22.1.0 (N4 bias correction, skull-stripping, segmentation, surface reconstruction, MNI152NLin2009cAsym normalization; motion correction, slice-time correction, CompCor, confounds including FD/DVARS; resampling with Lanczos interpolation), plus high-pass filtering (128 s) and spatial smoothing (5 mm FWHM). DNN feature models: AlexNet trained on ecoset; representational dissimilarity matrices (RDMs) from layers 1–8 using correlation distance, averaged across 10 instances. Primary low-/high-level models: layer 2 (low-level, Gabor-like orientation) and layer 8 (pre-softmax high-level, texture/category-like). Controls: layer 8 from an untrained (random) network, word2vec semantic dissimilarity (Google News 300), and animacy category dissimilarity (same vs different). RSA: Searchlight RSA (6 mm radius) on localizer data to confirm DNN-fMRI alignment in prediction-free context; Kendall’s tau correlations of neural RDMs with DNN layer RDMs; best layer per voxel determined. Univariate fMRI analysis: Event-related GLMs (double gamma HRF), separate regressors for expected and unexpected images, with parametric modulators (z-scored) capturing dissimilarity of unexpected images relative to the expected stimulus for layer 2, layer 8, and control models (animacy, word2vec, random layer 8). No-go modeled as nuisance; first-order derivatives included. Nuisances: 6 motion parameters, FD, CSF, WM. Second-level: fixed effects across runs; mixed-effects across participants; whole-brain GRF cluster correction (z ≥ 3.29, cluster p < 0.05). Regression analyses: Least-squares-separate single-trial estimates; regress BOLD and decoding metrics onto layer 8 dissimilarity in ROIs. ROI definition: Anatomical masks (FreeSurfer) for V1, LOC (lateral occipital gyri), and HVC (fusiform and occipito-temporal sulci), refined by localizer (intact > scrambled) and by selecting 200 most informative voxels via searchlight decoding; robustness checks across ROI sizes (100–500 voxels) and alternative stimulus-driven masks.
- Behavioral: Prediction facilitated performance. RT ANOVA F(1.4,43.3)=23.5, p<0.001, η²=0.42; Accuracy ANOVA F(1.4,46.3)=7.8, p=0.003, η²=0.20. Expected RTs (501 ms) were faster than unexpected same (509 ms; t(32)=2.41, p=0.019, dz=0.14) and unexpected different (524 ms; t(32)=6.77, p<0.001, dz=0.39). Unexpected same vs different: t(32)=4.36, p<0.001, dz=0.25. Accuracy decreased for unexpected different vs expected (t(32)=3.75, p=0.001, dz=0.61); no difference expected vs unexpected same (p=0.43); unexpected same vs different: t(32)=2.97, p=0.008, dz=0.48.
- Prediction-free localizer RSA: Confirmed ventral stream gradient—early cortex aligned with early DNN layers (low-level), higher areas with late layers (high-level).
- Main fMRI: Prediction error magnitudes scaled with high-level visual dissimilarity (layer 8) across visual cortex, including V1; no significant modulation by low-level dissimilarity (layer 2). Whole-brain cluster for layer 8 in visual cortex: 779 voxels (6,232 mm³).
- ROI parametric modulation (Layer 8 vs Layer 2): • V1: Layer 8 t(32)=6.79, p<0.001, d≈1.18; Layer 2 W=190, p=0.159, BF10=0.58; Layer 8 > Layer 2: t(32)=8.05, p<0.001, dz=1.40. • LOC: Layer 8 W=126, p=0.017, d≈0.55; Layer 2 t(32)=-0.68, p=0.504; Layer 8 > Layer 2: t(32)=4.24, p<0.001, dz=0.74. • HVC: Layer 8 t(32)=2.59, p=0.029, dz=0.45; Layer 2 t(32)=-0.70, p=0.586; Layer 8 > Layer 2: t(32)=2.85, p=0.008, dz=0.50.
- Layer-wise mapping of predictive modulation: High-level DNN layers (7–8) explained most variance in prediction error scaling across EVC, LOC, and HVC; minor clusters reflected intermediate/low-level layers.
- Monotonic scaling: Regression of BOLD onto layer 8 dissimilarity showed positive monotonic relationships: V1 t(32)=9.35, p<0.001 (mean r=0.09); LOC t(32)=3.27, p=0.004 (mean r=0.03); HVC t(32)=2.05, p=0.049 (mean r=0.02).
- Decoding fidelity: True class probability increased with layer 8 dissimilarity in V1 t(32)=5.50, p<0.001 (mean r=0.08) and HVC t(32)=3.89, p<0.001 (mean r=0.05); not significant in LOC t(32)=0.96, p=0.342.
- Controls: High-level visual dissimilarity (layer 8) outperformed animacy category, semantic word2vec dissimilarity, and random (untrained) layer 8. ROI contrasts (FDR-corrected): V1—layer 8 > all controls (all p<0.001, d>0.99); LOC—layer 8 > layer 2 (p=0.001, dz=0.74), > animacy (p=0.010, dz=0.61), trend vs word2vec (p=0.060), trend vs random (p=0.078); HVC—layer 8 > layer 2 (p=0.028, dz=0.50), > word2vec (p=0.016, d=0.54), trend vs animacy (p=0.081), ns vs random (p=0.189). Negative modulations: word2vec in V1 (p=0.020, dz=-0.60); animacy in LOC (p=0.039, dz=-0.51).
- Expectation suppression (unexpected > expected): Observed in visual cortex (including fusiform), anterior insula, inferior frontal gyrus, superior parietal lobule, paracingulate gyrus, supplementary motor cortex, consistent with prior prediction error signatures.
- VIFs indicated low collinearity (layer 2 VIF=1.42; layer 8 VIF=1.88; all <5), supporting distinct variance attribution.
Findings demonstrate that visual prediction errors scale with high-level visual feature surprise across the ventral stream, including in V1, diverging from feedforward tuning observed in prediction-free contexts. This supports a hierarchical predictive processing account where high-level predictions and their error signals are computed in later visual areas and broadcast as feedback to earlier areas, effectively constraining lower-level processing. The monotonic increase in both BOLD amplitude and decodable stimulus information with high-level dissimilarity suggests that unexpected inputs trigger stronger and potentially more sustained recurrent processing to resolve recognition under violated predictions. Control analyses indicate that the observed scaling is visual-feature-specific rather than driven by task-relevant animacy, semantic word category dissimilarity, or idiosyncrasies of the DNN architecture. The pattern of whole-brain results emphasizes a perceptual inference modulation localized to visual cortex rather than decision or motor-related systems, although generic prediction error contrasts replicate broader networks. Together, the data argue for top-down inheritance of high-level feature tuning in prediction contexts, providing evidence that predictions play a central role in shaping perception and that early areas can reflect high-level surprise via feedback.
The study shows that neural signatures of visual prediction errors scale predominantly with high-level visual surprise, even in early visual cortex, revealing a dissociation between feedforward tuning and predictive modulation. This suggests that high-level predictions constrain processing in earlier stages via feedback, supporting hierarchical predictive processing as a general principle of visual perception. Future work should probe the flexibility and boundary conditions of prediction error tuning—e.g., tasks demanding low-level feature predictions, richer stimulus sets, improved low-level and semantic feature models, and temporally resolved methods to map the dynamics of feedback—and clarify the roles of attention, arousal, and neuromodulatory systems in gating and amplifying predictive signals.
Low-level surprise might be detectable with superior feature models or stimulus sets tailored to evoke low-level predictions; nevertheless, early DNN layers robustly explained EVC feedforward responses in the localizer, arguing against model inadequacy. DNN neighboring layers are correlated, limiting inclusion of many layers in the same GLM; analyses focused on representative early (layer 2) and late (layer 8) layers, with similar results for layer 1/3 and 7. The semantic model (word2vec) may be suboptimal relative to recent language/scene models; improved semantic modeling or tasks could reveal semantic modulations. Stronger modulations in V1 versus LOC/HVC may reflect functional differences or neurovascular SNR disparities. Adaptation cannot explain results (equal stimulus frequencies across conditions). Attention/arousal cannot be entirely ruled out but are unlikely primary drivers given modulation specificity to high-level visual surprise and analyses restricted to unexpected trials.
Related Publications
Explore these studies to deepen your understanding of the subject.

