logo
Loading...
Computational reconstruction of mental representations using human behavior

Psychology

Computational reconstruction of mental representations using human behavior

L. Caplette and N. B. Turk-browne

This groundbreaking research from Laurent Caplette and Nicholas B. Turk-Browne explores a novel method for reconstructing mental representations of visual concepts. By analyzing participants' responses to images generated from deep neural networks, this study reveals how we associate semantic features with visual data, paving the way for insights into human perception and behavior.... show more
Introduction

The study seeks to uncover the contents of human mental representations for complex, natural visual categories using behavior alone. Traditional reverse correlation with pixel noise requires many trials and is limited to simple, fixed pixel patterns. Natural categories are defined by abstract, transformation-invariant features. Deep convolutional neural networks (CNNs) provide mid-level features that align better with human vision and brain activity than raw pixels, though they have limitations and potential biases. The authors propose generalizing reverse correlation by sampling pseudo-random mid-level CNN features to create stimuli and collecting open-ended semantic labels, then mapping between visual (CNN) and semantic (word embedding) spaces. The goals are to reconstruct mental representations across many concepts, validate them behaviorally, assess generalization to new tasks and stimuli, compare human representations to those of a CNN, and recover individual observers’ representations.

Literature Review

Reverse correlation and classification image methods can reveal internal representations but are typically limited to simple targets and require thousands of trials. CNNs trained on natural images learn abstract features predictive of high-level visual cortex responses and have been used to reconstruct perceived images from brain data and to synthesize superstimuli, though reconstructing internally generated content (e.g., imagery) is harder. CNN features may be more behaviorally relevant than pixels, yet standard CNNs can be brittle under experimenter manipulations; adversarially robust CNNs produce more human-like features. Alternatives include parameters of 3D generative models (effective but category-limited, e.g., faces). Prior work typically targets a small set of predefined stimuli; mapping a continuous semantic space of labels to a continuous visual feature space could enable large-scale reconstruction, leveraging the relatively low dimensionality of both semantic and visual spaces.

Methodology

Overview: The method relates a high-dimensional CNN visual feature space to a high-dimensional semantic feature space derived from participants’ labels, enabling reconstruction of mental representations for many concepts. Participants (main): 100 adults (18–35) from Prolific after exclusions; normal or corrected vision. Compensation $5. Exclusions: incomplete participation or >25% low-concreteness words. Stimuli (CNN-noise): An adversarially robust ResNet-50 trained on ImageNet (L2 adversarial loss, eps=3.0) provided the feature space. Primary sampled layer: layer 37 (last of stage 4); analyses used concatenated channel activations from layers 37 and 43. Channel activations were averaged spatially, standardized, decorrelated (ZCA/Mahalanobis whitening), and univariate density estimated per whitened feature. For each stimulus, random feature values were sampled, unwhitened (inverse transform), and unstandardized to obtain target CNN feature values. Images were optimized from random Fourier coefficients using activation maximization-like gradient descent to minimize MSE between current and target layer activations, with frequency normalization (1/f), color decorrelation (Cholesky), Adam optimizer (lr=0.05, β1=0.9, β2=0.999, weight decay=0.1), 1500 iterations. Median R^2 between target and achieved CNN features: 0.93. All experimental stimuli were unique across participants. Task: Each trial: 200 ms gray, 5 s CNN-noise image, then response entry. Participants typed 1–3 concise noun labels per image (auto-suggestions available but rarely accepted, median 9.3%). 5 practice + 100 experimental trials per participant, online via PsychoPy/Pavlovia. Size calibrated to ~6° visual angle. Label processing and semantic features: Remove stopwords, 1-character items, numbers; spelling correction via SymSpell (max edit distance 2), prioritizing visual words (Visual Genome list). Unrecognized words removed. Remaining words mapped to 300-d GloVe embeddings; multi-word entries split unless recognized as a unit. For each stimulus, average word vectors across its 1–3 responses to yield a trial-level semantic vector. Visual feature processing: For each trial stimulus image, collect CNN activations (layers 37 and 43), average spatially, and concatenate channels to form the visual feature vector. Dimensionality reduction and mapping: Apply PCA with whitening separately to visual and semantic features across trials; retain components explaining 90% variance (visual PCs=213; semantic PCs=127 in main analysis). Infer linear associations via the outer product (equivalent to multivariate multiple regression with decorrelated variables) to obtain a visual-semantic matrix (visual PCs × semantic PCs). Assess significance by 1000 trial-label permutations to create a null distribution; control family-wise error via maximum statistic. Reconstruction of mental representations: For any concept label, obtain its GloVe semantic vector, project into semantic PC space, multiply by the visual-semantic matrix to predict visual PC values, then invert PC transform to CNN feature space. Optimize an image to align with the predicted CNN features using the caricature objective p = arg max_p (y · φ(p))^α / (||y|| ||φ(p)||), α=4, 2000 iterations, lr=0.05, with small random transformations (rotation/translation/scale) each step to improve robustness and reduce artifacts. Also compute bootstrap CIs on CNN feature vectors by resampling trials to visualize uncertainty (2.5% and 97.5% bounds). Control analyses: (a) Null reconstructions by permuting semantic vectors across trials before mapping. (b) Seed variability by reconstructing with different random seeds. (c) Leave-name-out: recompute mapping excluding responses containing target concept’s name. (d) No-embedding reconstruction for frequent words using binary indicators per trial (concept named or not) to derive CNN features, then correlate with main-analysis CNN features. Prediction/generalization tests: (1) Predict semantic content of held-out practice stimuli via the learned mapping (compare predicted vs actual semantic vectors by cosine similarity). (2) Predict which stimuli depict a concept using correlations between stimulus CNN features and concept CNN features; threshold to match counts and compute Dice coefficients. (3) Predict behavioral similarity judgments (word arrangement task with 60 concepts) by correlating the representational dissimilarity matrix (RDM) of reconstructed visual CNN features with the behavioral RDM; compare to RDM from semantic embeddings. Comparison to DNN representations: Run the same stimuli through the DNN to obtain its top-3 ImageNet class label responses (mapped via WordNet if needed), process as with human data to get a DNN visual-semantic matrix and reconstructions. Compare within-group vs between-group correlations of representations across halves, projecting into a common concept-defined semantic space; significance via permutation. Individual representations: Eight Yale participants (5 women, 3 men), each completed 6 sessions (750 unique stimuli total; same stimuli across participants). Build per-individual visual-semantic matrices (150 visual PCs; 55–120 semantic PCs), reconstruct top concepts, visualize with t-SNE, and assess uniqueness via within-individual vs between-individual correlations in the common semantic space.

Key Findings
  • Visual–semantic mapping: Identified 67 significant associations between CNN PCs and semantic PCs (FWER-corrected, p<0.05). Nature-related semantic PCs aligned with grass- and water-like visual PCs; human/animal semantic PCs aligned with skin/fur/face-like visual PCs.
  • Reconstruction validation (2AFC): For 350 concepts (250 most-named + 100 frequent Visual Genome concepts named <10 times), mean accuracies were 88% (most-named), 84% (all validated VG concepts), and 74% (VG frequent but <10 named); all p<0.001. 270/350 concepts individually above chance (accuracy >75%; p<0.05 FWER-corrected). Best: bird, building, people (100%); worst: jeans 38%, white 30%, feet 25%.
  • Open-ended labeling (top 100 concepts): Exact correct label was most common for 37/100 concepts (p<0.001; 95% CI 28–47). Considering semantic proximity, 85/100 concepts showed more frequent semantically close responses (inverse relation of response frequency with semantic distance; all t(150)<-3.68, p<0.05 FWER-corrected).
  • Control: Null-permuted reconstructions looked markedly different; in a label task with 45 concepts, real reconstructions beat all three nulls for 78% of concepts (p<0.001). Semantic proximity of responses favored real over null for 98% of concepts; response entropy lower for 80%; semantic variability lower for 89% (all p<0.001).
  • Robustness to seed and uncertainty: Different seeds mainly changed spatial placement, not content; bootstrap CI reconstructions bounded plausible variability.
  • Embedding controls: Leave-name-out reconstructions remained similar (top-10 words feature correlations r=0.63–0.92; all p<0.001). No-embedding binary analysis for frequent words yielded very high correspondence with main analysis (top-10 r=0.91–0.99; all p<0.001); correlation declined with word frequency (log frequency explained 64% of variance in Fisher z), showing the embedding enables inferring less frequent and unseen concepts.
  • Generalization to new stimuli: Predicting semantic content of held-out practice images yielded mean cosine similarity 0.30 (95% CI 0.17–0.41; Z=20.13; p<0.001; range 0.09–0.49).
  • Predicting which stimuli depict a concept: For 10 most-named concepts, Dice coefficients 0.19–0.64; 9/10 significant (all except eyes; p from <0.001 to 0.15, FWER-corrected). Many “false positives” visually contained the concept, suggesting noisy ground truth.
  • Predicting behavioral similarity judgments: Visual RDM vs behavioral RDM Spearman ρ=0.56 (95% CI 0.458–0.647; p<0.001), exceeding semantic-embedding RDM vs behavioral ρ=0.48 (difference significant, Z=2.41; p=0.024).
  • Human vs DNN representations: Within-group representation correlations exceeded between-group (mean rwithin=0.70 vs rbetween=0.48; 95% CIs 0.701–0.705 vs 0.477–0.483; Z=8.73; p<0.002), indicating substantial differences. Qualitatively, DNN reconstructions appeared less clear/identifiable for many concepts.
  • Individual differences: Within-individual representation similarity exceeded between-individual (r=0.22 vs 0.11; 95% CIs 0.195–0.243 vs 0.104–0.124; Z=15.00; p<0.002), showing individually unique and stable conceptual representations.
  • Efficiency: On average, ~37 trials (~80 responses) sufficed to reconstruct a concept recognizable above chance in 2AFC, a large improvement over classic pixel-based reverse correlation (e.g., ~20,000 trials for a letter).
Discussion

The findings demonstrate that behavioral responses to images synthesized from pseudo-random mid-level CNN features can be used to recover linear associations between visual and semantic spaces and to reconstruct approximate visualizations of mental representations for many natural concepts. Reconstructions were recognizable to naïve participants, generalized to new stimuli, and predicted similarity judgments better than semantic co-occurrence embeddings. The approach distinguishes human from CNN internal representations and reveals stable, idiosyncratic individual differences, addressing the core question of what content is represented for concepts in the human mind. By leveraging dimensionality reduction and shared structure across concepts in both semantic and visual domains, the framework efficiently uses relatively few trials per concept while exploiting all trials to inform all concepts. This supports using such mappings to model and predict behavior beyond the specific task, and to interrogate representational structure across human observers and artificial networks.

Conclusion

The study introduces a scalable behavioral framework for reconstructing concept representations by mapping between semantic word embeddings and mid-level CNN features sampled via synthesized stimuli. It reconstructs hundreds of concepts with high recognition, generalizes to novel stimuli and tasks, differentiates human from DNN representations, and reveals individual-specific representations. Contributions include: a generalized reverse-correlation paradigm in abstract feature space; a visual–semantic mapping enabling large-scale reconstruction and prediction; and quantitative comparisons across humans, individuals, and models. Future directions include: exploring other feature spaces (layers, architectures, scene or generative models, 3D models), improving semantic modeling (visual or human-derived embeddings, sentence embeddings, non-linear mappings), extending to phrases/compositions, probing developmental/cultural/expertise differences, and linking representational idiosyncrasies to behavioral performance.

Limitations
  • Feature space constraints: Using mid-level CNN features restricts the space of possible images and may introduce biases related to network architecture, training data (e.g., ImageNet category biases), and chosen layers; features may differ from those used by humans in altered stimuli contexts.
  • Approximate reconstructions: Outputs are projections onto image space that highlight features strongly associated with concepts, not exhaustive depictions of all valid instances; magnitude of feature vectors is arbitrary; results can depend on optimization details.
  • Semantic embedding dependence: Word embeddings trained on text corpora may misrepresent human semantic structure for some concepts, biasing reconstructions—especially for infrequent labels; limited to single-word labels in this implementation.
  • Linearity assumption: The mapping between semantic and visual features was linear; non-linear relations may not be captured.
  • Behavioral labeling noise: Open-ended label data can be sparse or noisy for some stimuli/concepts; ground truth for stimulus content is estimated from limited responses.
  • Generalizability scope: Validations covered subsets of all possible concepts; additional concepts require separate validation; reconstructed plural/singular and closely related concepts need further testing.
  • Stimulus synthesis choices: Adversarial robustness, layer selection, and objective function affect stimulus and reconstruction appearance; small seed-dependent spatial variations remain.
  • Incompleteness of inferred features: Reported features may reflect salient perceived aspects rather than full mental representations; reconstructions could omit rarely reported but relevant features.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny