Computer Science
Comparing feedforward and recurrent neural network architectures with human behavior in artificial grammar learning
A. Alamia, V. Gauducheau, et al.
This fascinating study by Andrea Alamia, Victor Gauducheau, Dimitri Paisios, and Rufin VanRullen explores the competition between feedforward and recurrent neural networks in mimicking human behavior during artificial grammar learning. Discover how recurrent networks outperform their counterparts, especially in simpler grammar tasks, highlighting their potential in modeling explicit learning processes.
~3 min • Beginner • English
Introduction
The study asks which neural network architecture—feedforward (FF) or recurrent (RR)—better captures human behavior in artificial grammar learning (AGL) across different levels of the Chomsky hierarchy (regular, context-free, context-sensitive). Formal Language Theory defines languages via grammars with varying computational complexity, but prior human AGL results suggest cognitive complexity does not align with formal complexity. Competing cognitive accounts propose implicit learning of rules, acquisition of microrules/chunks, or sequence learning via recurrent-like mechanisms. AGL also distinguishes implicit and explicit learning; simpler grammars tend to be processed explicitly, whereas more complex grammars rely on implicit processes, constrained by working memory limits. The authors hypothesize that, when trained on a comparable number of examples as humans, RR networks will more closely match human learning—particularly for more explicit, simpler grammars—while FF architectures may align more with implicit learning dynamics.
Literature Review
Prior AGL work shows humans perform above chance after viewing few dozen examples, often without explicit rule knowledge. Theories include Reber’s implicit learning hypothesis, microrule/chunking accounts, and recurrent sequence learning models. Implicit vs explicit learning literature indicates dual systems, with implicit processes automatic and explicit processes hypothesis-driven; complex grammars more likely engage implicit processing due to working memory limits. Neural network studies historically emphasize recurrent models for AGL (including SRNs and LSTMs) and often use large training datasets (thousands to millions of sequences), limiting fair comparison to human learning. Some studies suggest RR networks can represent FSA states and learn CF/CS languages by counting, but typically with extensive training. Visual cognition literature relates FF to rapid, unconscious processing and RR to conscious, top-down processes, suggesting a computational parallel for AGL.
Methodology
Datasets/grammars: Four grammars spanning Chomsky hierarchy levels: two regular grammars (A, B), one context-free (CF, Type II), and one context-sensitive (CS, Type I). Regular A used a 4-letter vocabulary with sequence lengths 2–12; Regular B used 5 letters with lengths 3–12; sequences longer than 12 were discarded. CF and CS grammars used 10 letters arranged in 5 pairs, with lengths 4, 6, or 8; sequences were generated via symmetry: CF mirrored the first half applying pairings; CS applied a translation to the second half. Incorrect sequences contained a single violation (e.g., swapping two non-identical letters in one half for CF/CS; single inexact item for regular grammars). Generation details are illustrated in Fig. 1A.
Human participants: N=56 (31 female, age 25.4±4.7). Grammar-specific Ns: Regular A (n=15), Regular B (n=15), CF (n=15), CS (n=11). Ethics approval obtained; informed consent given.
Human experimental design: Each trial: fixation 500 ms, then a letter string displayed until response; participants classified sequences as belonging/not belonging via two keys; immediate visual feedback for 2 s (green correct, red incorrect). No time constraints; accuracy and RTs recorded. Each participant completed 10 blocks in a ~1 h session. Blocks 1–8 (implicit) had 60 trials each (480 total); participants were not told about underlying rules. After block 8, a questionnaire assessed explicit rule knowledge: for Regular A/B, 7 multiple-choice questions; for CF/CS, indicating the wrong letter in 7 novel incorrect strings; confidence rated 0–100; participants also described rules in free report. Block 9 (explicit, 20 trials): participants were given a printed grammar scheme (Fig. 1A) and could consult it before each response. Block 10 (memory, 20 trials): same task without access to the scheme.
Neural networks: Implemented in Keras/TensorFlow. Goal: train FF and RR networks on the same datasets and similar number of trials as humans.
Parameter search: Explored 4 parameter spaces with approximately 1,400; 7,900; 31,000; and 122,000 parameters. Two axes: number of layers (2, 3, 4, 5, 7, 10) and learning rate (20 values; distinct ranges for FF vs RR). For each grammar and parameter space, 120 networks were each trained 20 times with random initialization. Training set: 500 sequences (comparable to human 480); validation: 100; test: 200. All layers within a network had the same neuron count (except single-neuron output). For each grammar and parameter space, selected the network whose test accuracy after training on 500 examples had the smallest absolute difference from the average human accuracy in the last implicit block. Learning curves were then estimated for the selected models by varying training size from 100 to 500 in steps of 100 (20 random initializations each), using the same 100/200 validation/test sizes.
Feedforward architecture: Fully connected dense layers. Input: one-hot encoding of up to 12 positions by K letters (12×K inputs), with zero-right padding for shorter sequences; K=4 (A), 5 (B), 10 (CF/CS). Hidden activations: ReLU; output: sigmoid. Loss: binary cross-entropy. Optimizer: SGD with Nesterov momentum 0.9, decay 1e-6. Training: 1 epoch (500 trials), batch size 15 in both parameter search and learning curve estimation.
Recurrent architecture: Fully recurrent connected layers; inputs presented one letter at a time (one-hot vector of alphabet size). Output: sigmoid classification after final letter. Loss: binary cross-entropy; learning via backpropagation through time. Optimizer: RMSprop (rho 0.9, epsilon 1e-8, decay 0). Training: 1 epoch (500 samples), batch size 15 in both phases.
Data analysis: Human accuracy and RT analyzed with Bayesian ANOVA (JASP), reporting Bayes Factors (BF) and error. BF>3 indicates substantial evidence; >20 strong; >100 very strong; BF<0.3 suggests lack of effect. For human–network comparisons, Bayesian ANOVA with factors TRIAL (training size: 100–500), AGENT (Human, FF, RR), and GRAMMAR (A, B, CF, CS). Additional analyses examined performance by sequence length and ROC comparisons.
Key Findings
Human learning: Across all grammars, accuracy improved over implicit blocks (BLOCK effect BF >> 100), indicating learning above chance; RTs showed no significant BLOCK effects (remained ~2–3 s). Comparing 8th implicit block to explicit and memory blocks (CONDITION: implicit-8, explicit, memory) revealed significant differences for all grammars (BF >> 10), with accuracy increases in explicit; memory vs explicit was similar (BF << 3) for all except Grammar B (BF=7272), suggesting participants could recall the rules. Questionnaire: For Regular grammars, Sensitivity for rule knowledge was higher in Grammar A than B (BF>100), with no difference in Specificity (BF=0.647). For CF/CS, only 3 participants per grammar correctly reported the scheme and consistently identified wrong letters; others failed both tasks, justifying an ‘implicit’ subgroup.
Network parameter search: FF networks performed best with the lowest number of layers (2), while RR networks performed best with low learning rates. Networks closest to human accuracy were not always those with the highest raw accuracy (notably for RR in regular grammars). Averaged across grammars, this pattern persisted: FF best at shallow depth; RR best at low LR.
Human–network comparison over training size: Bayesian ANOVA showed strong main effects of TRIAL, AGENT, and GRAMMAR (all BF >> 3e+15; error < 0.01%) and a strong AGENT×GRAMMAR interaction (BF=2.3e+14; error < 0.01%). Post-hoc per grammar: For CF and CS, Humans vs FF and RR vs FF differed (all BF>1000), while Humans vs RR did not (BF<1). For Regular A, all pairs differed (Human–FF, RR–FF BF>1000; Human–RR BF=10.53). For Regular B, no differences between agents (BF=0.79). ROC analyses corroborated that RR and Humans shared similar performance patterns, whereas FF underperformed (except Grammar B).
Performance by sequence length: Strong LENGTH effects in all grammars (all BF >> 1000), with better performance on shorter sequences. AGENT differences significant for all but Grammar B (BF=0.38); post-hoc tests again showed RR closer to Humans than FF.
Regular grammar generalization (10 grammars): RR outperformed FF across all 10 regular grammars. The RR–FF performance gap negatively correlated with multiple grammar complexity metrics (#letters, #states, #rules, #bigrams, minimum sequence length; all BF>5), indicating RR’s advantage is larger for simpler grammars. Two new grammars at opposite complexity extremes confirmed predictions: a large RR–FF difference (~0.11) for the simplest and a small difference (~0.01) for the hardest.
Additional observation: With larger training sets (~10,000 sequences; not shown), RR still exceeded FF on regular grammars (FF ~0.85, RR ~0.95) but differences diminished on CF/CS (both architectures ~0.75).
Discussion
Findings address the core question by demonstrating that recurrent architectures better match human AGL behavior than feedforward networks across grammar complexity levels and training dynamics, particularly when training examples are limited to human-like exposure. The temporal learning curves and sequence-length effects of RR networks mirror human patterns, suggesting that recursion and temporal dependency modeling are central to human grammar learning. The exception (Regular Grammar B) aligns with the idea that more complex or less explicit grammars may be learned implicitly, where FF architectures can approximate human performance. The extended analysis over 10 regular grammars shows RR advantages are most pronounced for simpler, more explicitly learnable grammars, supporting a mapping between RR and explicit processes, and FF and implicit processes, paralleling feedforward vs recurrent distinctions in visual cognition. The study emphasizes the importance of fair, matched training regimes for human–model comparisons and highlights learning rate as a critical parameter for RR performance.
Conclusion
Recurrent neural networks more closely reproduce human behavior in artificial grammar learning than feedforward networks, regardless of the grammar’s level in the Chomsky hierarchy, when trained on human-comparable numbers of examples. Across ten regular grammars, RR superiority is greatest for simpler grammars—those more likely to be learned explicitly—supporting the hypothesis that explicit knowledge is best modeled by recurrent architectures, while FF may capture implicit dynamics. The results endorse recursion as a key component in cognitive language processes. Future work should extend to more ecological grammars, explore neuroimaging to link computational and neural dynamics, and examine broader architectures (e.g., gated RNNs) with carefully controlled training to further validate these conclusions.
Limitations
- Training regimes were limited to ~500 examples and a single epoch (batch size 15), which, while matching human exposure, may undertrain certain architectures (e.g., LSTMs).
- Architectures were restricted to fully connected FF and simple fully recurrent layers; gated or convolutional/residual variants might mitigate issues like vanishing gradients and handle longer dependencies better.
- Parameter searches, while systematic (layers and learning rates across four parameter-count regimes), did not exhaustively explore all hyperparameters (e.g., neuron counts per layer, regularization, optimizers beyond SGD/RMSprop).
- The Chomsky hierarchy does not fully capture cognitive complexity; results may not generalize to natural languages and more ecological grammars.
- Sample sizes per grammar (n=11–15) are modest; explicit/memory blocks had only 20 trials each.
- RTs were not modeled by the networks; only accuracy was compared.
- The human design combined training and testing, which facilitates learning-curve comparison but differs from classic two-phase AGL designs.
Related Publications
Explore these studies to deepen your understanding of the subject.

