Computer Science
Face detection in untrained deep neural networks
S. Baek, M. Song, et al.
The study investigates whether neuronal face-selectivity can emerge innately or requires visual experience, and whether face-selectivity is special compared to selectivity for other objects. Prior work presents conflicting evidence: experience-dependent development (e.g., lack of normal face domains in face-deprived monkeys) versus observations of primitive face-selectivity without visual experience (e.g., infants’ preference for face-like stimuli, category-selective domains in congenitally blind adults). There is also debate on whether face-selectivity is a specialized function distinct from general object recognition, or whether object selectivity for non-face categories can arise similarly. Experimental constraints in controlling visual experience motivate a modeling approach. The authors use a biologically inspired deep neural network (DNN) mimicking the ventral visual stream to test if face-selective units can arise in completely untrained, randomly initialized networks and whether such selectivity enables face detection and extends to non-face objects.
The paper reviews: (1) evidence supporting experience-dependent development of face-selectivity in primates, including delayed emergence of robust tuning in IT, effects of deprivation, and experience-shaped preferred features; (2) evidence for innate or experience-independent selectivity, including infants’ behavioral preference for faces, face-selective cortical areas in congenitally blind adults, and early face-selective neurons in infant animals; (3) the debate on whether the fusiform face area (FFA) is face-specialized or can develop expertise for other objects (e.g., cars, birds); (4) modeling literature showing DNNs predict IT responses, random hierarchical networks can support cognitive functions without training (e.g., image classification, deep image prior), and provide a priori information about natural image statistics. These lines of work motivate testing for spontaneous emergence of object selectivity, including faces, in untrained hierarchical networks.
Model: AlexNet was used as a biologically inspired hierarchical CNN of the ventral visual stream, focusing on the five convolutional layers (feature extraction). Classification layers were discarded when probing unit selectivity. Initialization: Networks were completely untrained. Convolutional weights were randomly initialized using standardized schemes (Gaussian or uniform, zero mean; variance scaled as in efficient initialization) and varied across a wide range (5–200% of baseline standard deviation) to test robustness. Stimuli: A low-level feature-controlled dataset (derived from fLoc) with six classes (face, scrambled face, and four non-face objects: hand, horn, flower, chair) was used to assess selectivity while controlling luminance, contrast, size, position, and intra-/inter-class similarity. Novel face sets included: 16 images from Tsao et al.; 50 VGGFace2 images; 50 FaceGen synthetic faces (color and grayscale); and held-out images from the original dataset. Additional datasets tested invariance to size, position, rotation; viewpoint variation (-90°, -45°, 0°, 45°, 90°). For broader object selectivity, 1000 ImageNet classes were used; faces from VGGFace2 were added as a separate class in some analyses. Unit definition and selectivity: A "unit" is each spatial position in a Conv layer channel (e.g., Conv5 has 13×13×256 units). Responses were z-scored per unit across stimulus images. Face-selective units were defined by significantly higher mean responses to faces than to any non-face category (P < 0.001, two-sided rank-sum test). Face-selectivity index (FSI) quantified tuning: FSI = sqrt(((R_face − R_non-face)^2)/((R_face + R_non-face)/2)). Control FSIs were computed from shuffled responses. Control images for part-based sensitivity: Scrambled faces (local patches permuted) and texform faces (global structure disrupted, local texture preserved) assessed sensitivity to global configuration versus local parts. Analogous controls were used for non-face objects (e.g., gazania). Preferred feature images (PFIs): Two approaches estimated preferred inputs. (1) Reverse-correlation (RC) with iterative refinement: responses to 2500 random 2D Gaussian blobs were used to compute a weighted sum image; this PFI was iteratively refined by adding new random blobs plus the current PFI for 100 iterations. (2) X-Dream: a pre-trained GAN generated images whose latent codes were optimized via a genetic algorithm over 100 iterations to maximize the target unit’s response. A face-configuration index (FCI) quantified pixel-wise correlation between PFIs and 200 face stimuli. Invariance analyses: Responses of face-selective units were measured across variations in size, position, rotation (including inversion), viewpoint, and mirror-symmetric viewpoint specificity. Invariance indices and ANOVAs assessed viewpoint invariance and tuning symmetry. Face detection task: A linear SVM was trained on Conv5 responses to classify face vs non-face. To avoid double-dipping, 200 images per class were used for unit selection, with the remaining 60 per class split into 40 train and 20 test images per class. Performance was compared using (a) single face-selective units vs non-selective units and shuffled responses; (b) multiple units, varying counts from 1 to the total number of face units (n=465), versus randomly sampled non-selective units; and (c) all Conv5 units (n=43,264). Generalization was tested to faces with size/position/rotation variations after training on fixed conditions. Training effects: To examine how experience affects innate selectivity, AlexNet was trained for image classification on three datasets: (1) Face-reduced ImageNet (500 classes curated to exclude recognizable faces from ILSVRC 2010), (2) Original ImageNet (1000 classes), and (3) ImageNet with added face class (1001 classes including the study’s face images). Training used SGD (batch 128, momentum 0.9, weight decay 0.0005, 90 epochs, LR 0.01 decayed 10× every 30 epochs). Post-training, face-selective units, FSIs, counts, PFIs, and SVM detection performance were re-evaluated. Latent space clustering and object selectivity: For each of 1000 ImageNet classes and faces, Conv5 responses were analyzed using PCA; clustering consistency was quantified by the silhouette index (SI) using all PCs. The number of selective units per class was related to SI to assess whether simpler, more separable classes induce more innate selectivity. Statistics: Rank-sum tests predominated; Kolmogorov–Smirnov tests, one-way ANOVAs with Bonferroni corrections, and Pearson correlations were used where appropriate. Exact P-values, effect sizes, and Ns are reported in text and figures.
- Spontaneous face-selectivity in untrained CNNs: A substantial number of face-selective units emerged in randomly initialized AlexNet without any training (mean 250 ± 63 units in Conv5 across 100 random networks, P < 0.001). Layer-wise frequencies: Conv1 0.008 ± 0.002%, Conv2 0.047 ± 0.009%, Conv3 0.491 ± 0.089%, Conv4 0.534 ± 0.103%, Conv5 0.579 ± 0.146% of units.
- Tuning strength comparable to primate IT: The distribution of FSI in untrained Conv5 face units was comparable in magnitude to monkey IT face-selective neurons and significantly above shuffled controls (various FSI definitions tested).
- Global configuration selectivity: Face units responded significantly more to intact faces than to scrambled or texform faces (Face vs Scrambled: P=1.71×10^-52; Face vs Texform: P=4.12×10^-30); responses to control images were not greater than to non-face images, indicating selectivity to holistic face configuration.
- Generalization to novel faces: Face units showed significantly higher responses to multiple novel face sets (Tsao et al., VGGFace2, FaceGen synthetic faces; P ≤ 1.47×10^-11) than to non-face images.
- Robustness to initialization: The number and FSI of face units were largely unchanged across 5–200% variations in weight scale and across Gaussian vs uniform initializations.
- Preferred feature images: PFIs derived via RC and X-Dream increased unit responses beyond those to face stimuli, confirming effective optimization. Face units’ PFIs exhibited face-like configurations; non-face selective units’ PFIs reflected their object classes. Face-configuration index (pixel-wise correlation to face images) was significantly higher for face-unit PFIs than for non-face PFIs (RC and X-Dream; P ≤ 5.63×10^-4 and P ≤ 1.93×10^-5, respectively).
- Invariances and behavioral hallmarks: Single face units maintained selectivity across size, position, and rotation variations; exhibited an inversion effect (lower responses to upside-down faces); included viewpoint-invariant units with increasing prevalence across layers and mirror-symmetric viewpoint-specific units.
- Face detection via read-out: An SVM trained on a single face-selective unit outperformed shuffled controls (P=2.97×10^-121), whereas non-selective units did not. Using multiple units, face-selective units yielded substantially higher performance than the same number of non-selective units across 1–465 units (P ≤ 1.45×10^-33). Performance using 465 face units nearly matched using all Conv5 units (43,264).
- Training modulates innate selectivity: Training on a face-including dataset increased FSI and improved face detection performance relative to untrained networks (P=1.04×10^-3), despite a reduction in the number of face units, suggesting sharpening and pruning of tuning. Training on face-reduced or original ImageNet (without explicit face label) decreased FSI and degraded face detection (face-reduced vs untrained P=1.06×10^-22). PFIs after face-including training showed clearer face configurations; face-reduced training disrupted face PFIs.
- Non-face object selectivity: In the 1000-class ImageNet stimulus, 39 classes exhibited selective units in untrained networks. Controls (scrambled/texform) indicated global object configuration selectivity (e.g., gazania: intact vs scrambled P=2.67×10^-12; intact vs texform P=2.08×10^-14). Classes with simpler, more separable representations (higher silhouette index in Conv5 latent space) had more selective units (Pearson r=0.585, P=7.48×10^-5); no selective units were observed when SI < 0.036, suggesting a clustering threshold for emergent selectivity.
The findings demonstrate that hierarchical, randomly initialized feedforward architectures are sufficient to produce face-selective units with tuning properties reminiscent of primate IT, including holistic sensitivity, invariances, and the inversion effect. This supports the hypothesis that primitive face-detection capabilities can arise innately from the statistical structure of random feedforward circuitry before visual experience. The work reconciles conflicting empirical observations by showing that experience can subsequently sharpen or weaken such innate selectivity: explicit exposure and labeling can enhance tuning (higher FSI, better detection with fewer units via pruning), while deprivation or training without explicit face categorization can diminish it. The emergence of non-face object-selective units and their relation to latent-space clustering suggest that faces are not a special case but one of several categories with simple, separable statistics that promote selectivity. Conceptually, the results align with ideas from reservoir computing and lottery ticket hypotheses, where random networks provide rich features usable by trained read-outs; here, however, single-unit functional tuning arises without any training, and a simple read-out (SVM) can leverage these representations for behaviorally relevant tasks.
The study shows that face-selective units emerge robustly in completely untrained, randomly initialized hierarchical neural networks and that these units enable face detection. Their tuning matches several hallmarks observed in primate IT. Training with appropriate data sharpens and prunes this innate selectivity, improving task performance even with fewer face units, while certain training regimes can weaken it. Moreover, selective units for non-face objects also arise, with their prevalence predicted by the separability (clustering) of class representations in latent space. These results suggest that random feedforward wiring provides a proto-organization for visual selectivity that experience then refines. Future work could test other architectures and biological constraints (e.g., more biologically plausible connectivity and learning rules), probe developmental trajectories and gene-guided regionalization, and investigate how top-down influences interact with innate selectivity.
- Biological plausibility: Convolution, weight sharing, and other CNN components are not strictly biologically realistic, limiting direct translation to neural circuitry.
- Spontaneity vs genetic programming: While selectivity arises from random weights, the development of such wiring in biology may require gene-driven processes; thus, “spontaneous” innateness remains to be fully established.
- Developmental refinement: The innate selectivity likely represents a proto-template that requires extensive refinement by visual experience and top-down processes; adult-like tuning is not implied.
- Regional localization: The model does not explain why face-selective neurons localize to specific cortical patches; genetic and developmental mechanisms are not modeled.
- Dataset and architecture specificity: Results are shown primarily in AlexNet; generalization to diverse architectures and more naturalistic early visual inputs warrants further validation.
Related Publications
Explore these studies to deepen your understanding of the subject.

