logo
ResearchBunny Logo
Efficient neural codes naturally emerge through gradient descent learning

Computer Science

Efficient neural codes naturally emerge through gradient descent learning

A. S. Benjamin, L. Zhang, et al.

This groundbreaking research by Ari S. Benjamin, Ling-Qi Zhang, Cheng Qiu, Alan A. Stocker, and Konrad P. Kording explores how artificial neural networks, when trained for object recognition, naturally develop a heightened sensitivity to common environmental features, much like human sensory systems. The study reveals that efficient codes result from gradient-like learning processes, a fascinating insight into the intersection of neuroscience and artificial intelligence.... show more
Introduction

The study investigates why perceptual sensitivity in biological systems is non-uniform across stimulus dimensions and whether similar patterns can arise in artificial systems without explicit efficient-coding constraints. Classical findings such as the oblique effect show higher sensitivity near cardinal orientations, often explained by the efficient coding hypothesis, which posits that sensory systems allocate more resources to frequent features of the environment. Developmental evidence indicates that perceptual sensitivities mature with experience, implying learning-driven changes in neural representations. The central hypothesis is that general, task-oriented learning—specifically gradient descent—implicitly biases neural representations toward common, high-variance input features, producing efficient coding patterns even without explicit coding constraints (e.g., noise, bottlenecks) or explicit efficient-coding objectives. The work aims to test this hypothesis empirically in deep networks and explain it analytically in linear models, linking learning dynamics to input statistics and developmental changes in sensitivity.

Literature Review

Prior psychophysical work demonstrates non-uniform sensitivity (e.g., the oblique effect) and links perceptual sensitivity to environmental statistics under the efficient coding framework. Efficient coding has been justified normatively via resource constraints (noise, neuron number, metabolic costs) and implemented in models via objectives such as information maximization, sparse coding, and ICA. In deep learning, implicit biases of gradient descent are recognized as shaping what networks learn, beyond architecture alone, with theories describing spectral/feature learning order in linear and kernel regimes. Previous reports show DNNs share similarities with ventral stream representations and can memorize noise, indicating large capacity and suggesting implicit constraints govern learned representations. Preliminary and concurrent findings have linked orientation sensitivity in ANNs to image statistics and architecture. The current work builds on deep linear network learning dynamics (e.g., Saxe et al.) and implicit regularization theories to provide a mechanistic account of efficient coding emerging from gradient-based learning.

Methodology
  • Sensitivity definition: For a layer with activations y in a network, sensitivity to a sensory variable θ is D(y; θ) = Σ_i (∂y_i/∂θ)^2, interpretable as Fisher Information under unit-variance Gaussian noise and one stimulus per θ.
  • Stimuli: Differentiable generators in PyTorch produced Gabor gratings (orientation θ; fixed spatial frequency and contrast; phase marginalized by averaging over 10 phases), sinusoidal gratings (spatial frequency varied; phase marginalized over 64 phases), and uniform color images (HSV hue varied; S,V marginalized over [0.5,1]).
  • Deep network analyses: Pretrained ImageNet models (ResNet18, VGG16 with batch norm, Vision Transformer) were evaluated layer-wise for sensitivity to orientation and hue via backpropagated gradients. Controls included untrained networks (default init), post-training intra-tensor weight shuffling, and retraining on modified ImageNet (all images rotated by 45° or hue-shifted by 90°). Training used SGD (lr=0.1 with decays every 30 epochs, momentum 0.9, weight decay 1e-5, batch size 256, 90 epochs) with standard augmentations (random horizontal flips, random crops).
  • Image statistics: Hue histograms computed over all ImageNet training pixels (hue binned at 1°, S,V in [0.5,1]).
  • Linear network experiments: Constructed >100k 32×32 grayscale natural image patches from ImageNet; PCA performed to obtain PCs and variances (power-law decay). Linear autoencoder (single or multilayer, no bottleneck; hidden dimension equal to input) trained with SGD (learning rate 1.0, batch size 16384) from small initialization (scaled by 1/100) to ensure rich regime. Sensitivity to PCs and spatial frequency measured during training. Analytical framework used SVD of weight product W=USV^T; prior results imply right singular vectors align with data PCs. For reconstruction, singular value growth rate dσ_i/dt ∝ λ_i σ_i (1−σ_i), where λ_i is the standard deviation of the ith PC, yielding sigmoidal growth and ordered learning by variance. Sensitivity to PC i equals σ_i^2; thus sensitivity tracks input variance at finite times.
  • Supervised experiment (Fig. 6): Trained a 3-layer ReLU network on orientation classification of sinusoidal gratings into 60 bins. Orientation sampled either uniformly or from 2cos(2x); random phase ensures unique images. Cross-entropy loss. Label information controlled by injecting Bernoulli dropout noise on one-hot labels with orientation-dependent rate calibrated so that total dataset information per orientation is uniform (p(θ)·I[y_θ] constant via I[y_θ]=1−H2(σ(θ))). Examined sensitivity of first layer to orientation over training under three conditions: varying label noise at uniform frequency, varying input frequency without noise, and balanced frequency-information with orientation-dependent noise to equalize information across orientations.
  • Architectural analysis: Examined role of spatial pooling and overlapping receptive fields (e.g., AlexNet-style oversampling of rectangular grids) on residual cardinal sensitivity independent of training data rotations.
  • Human comparison: Mapped model sensitivity to spatial frequency “detection thresholds” via inverse squared sensitivity, compared to psychophysical data (e.g., Caelli et al., spatial frequency thresholds linear with frequency) and developmental acuity trajectories (linear increase with age).
Key Findings
  • Deep networks trained on ImageNet exhibit higher sensitivity to changes near cardinal orientations, paralleling human oblique effect and matching natural image statistics. Sensitivity increases deeper in networks and is robust to stimulus parameterization.
  • Sensitivity patterns depend on learning and input statistics: untrained networks lack the pattern; weight shuffling post-training abolishes it; training on images rotated by 45° shifts sensitivity toward oblique orientations. However, architectural factors (e.g., overlapping pooling on rectangular grids) contribute a learning-independent bias toward cardinals, so rotation does not fully rotate the pattern.
  • Hue sensitivity in ResNet18 tracks the empirical hue frequency in ImageNet; peaks roughly align with human hue sensitivity in perceptually uniform color space. Training on hue-rotated images shifts network hue sensitivity accordingly, indicating causality from input statistics.
  • Linear autoencoder analysis shows that gradient descent causes the weight singular values to grow at rates proportional to data PC standard deviations, leading networks to learn high-variance (common) components first. Consequently, at finite training time, output sensitivity is higher for earlier PCs, despite the existence of an unconstrained exact solution (W=I).
  • The linear model reproduces human-like spatial frequency sensitivity: inferred detection thresholds increase approximately linearly with frequency above a cutoff and saturate below it; model “acuity” (max frequency below a threshold) increases nearly linearly with training steps, mirroring developmental increases in human spatial acuity.
  • Supervised tasks disentangling frequency and label information show that gradient descent prefers more frequent features even when the total information per label is experimentally equalized via orientation-dependent label noise. Features uninformative about labels are not learned despite being frequent, confirming dependence on task relevance plus an additional frequency-driven bias.
  • Overall, a frequency/sensitivity correspondence emerges from gradient descent learning dynamics across unsupervised and supervised settings, even in overparameterized, noiseless networks, providing a mechanistic account of efficient coding signatures.
Discussion

The findings support the hypothesis that gradient-based learning inherently biases neural representations toward common, high-variance input features, yielding efficient-coding-like sensitivity profiles without explicit resource constraints. Artificial networks trained on natural images exhibit orientation and hue sensitivities aligned with dataset statistics, and linear theory explains this via singular value dynamics: components with larger variance are learned faster, so finite learning leads to elevated sensitivity for common features. This mechanistic account complements normative efficient coding (which assumes explicit constraints like noise or limited neurons), offering an alternative pathway by which similar sensitivity patterns can arise. The approach also accounts for developmental phenomena: as learning progresses, sensitivity and acuity improve predictably, without requiring reductions in internal noise. While nonlinearity and architectural biases complicate direct generalization from linear models, the qualitative alignment between theory, deep ANN behavior, and human psychophysics suggests that learning dynamics are a significant driver of observed perceptual sensitivity patterns. The work distinguishes between implicit limitations due to finite learning and explicit resource constraints, arguing both can yield similar signatures. It also situates the results within rich vs. lazy learning regimes and highlights possible complementary influences from training stochasticity (e.g., batch noise, dropout).

Conclusion

This study demonstrates that efficient-coding-like neural sensitivities emerge naturally from gradient descent learning. Empirically, deep networks trained on ImageNet show sensitivity patterns for orientation and hue that track dataset statistics and shift when statistics are altered. Analytically, in linear networks, singular values grow at rates set by input variances, causing preferential learning of frequent features and elevated sensitivity at finite training times. The mechanism extends to supervised settings, where frequency biases sensitivity even when label information is equalized. These results provide a mechanistic account connecting learning dynamics to efficient coding signatures and offer explanations for developmental changes in perceptual acuity. Future research directions include quantitatively modeling human learning rates as functions of input statistics and task importance, experimental manipulations that independently vary frequency and label information, exploring effects of batch size and stochastic training, extending analyses to nonlinear networks and kernel regimes, and investigating adaptive phenomena (e.g., sensory adaptation) under task-driven continual learning.

Limitations
  • The core mathematical analysis relies on deep linear networks with small initializations; extensions to nonlinear networks and biological systems are nontrivial due to feature interactions and lack of exact decompositions.
  • Architectural factors (e.g., convolutional sampling on rectangular grids, overlapping pooling) can impose sensitivity biases independent of learned statistics, complicating attribution.
  • Networks studied are overparameterized and largely noiseless at evaluation; while demonstrating the sufficiency of learning dynamics, this setting differs from biological constraints.
  • Empirical findings are primarily qualitative; direct quantitative fits to human psychophysical data are limited in scope.
  • The supervised frequency–information dissociation uses synthetic stimuli and engineered label noise; generalization to complex real-world tasks and labels may involve additional factors.
  • Analyses focus on variance as the key statistic; frequency–variance relationships can differ for non-binary or strongly dependent features.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny