
Neuroscience
Emergence of syntax and word prediction in an artificial neural circuit of the cerebellum
K. Ohmae and S. Ohmae
Explore how Keiko Ohmae and Shogo Ohmae unravel the cerebellar role in language processing with their innovative biologically constrained artificial neural network. Their findings reveal a surprising unity in the cerebellar pathways for word prediction and syntax recognition, reshaping our understanding of cognition and motor functions.
~3 min • Beginner • English
Introduction
Language comprehension relies on interactions between the left neocortical language areas and the right lateral cerebellum, yet the circuit-level computations in cerebellum remain unclear. Prior work implicates two cerebellar language functions: next-word prediction, which aids comprehension by placing inputs in predictive context, and grammatical processing, especially S-V-O syntactic recognition. Given the cerebellum’s uniform cytoarchitecture, a key question is how such distinct functions could arise within a single circuit. The authors develop a biologically constrained cerebellar ANN (cANN) incorporating a recurrent pathway to test whether one cerebellar-like circuit, trained solely for next-word prediction with minimal external structure in the inputs, can also spontaneously develop syntactic processing. The study aims to elucidate whether a single recurrent cerebellar computation underlies both prediction and rule extraction from sequences, thereby informing broader cerebellar contributions to cognition and suggesting therapeutic avenues where training prediction could improve syntax comprehension.
Literature Review
Evidence links the right lateral cerebellum to language, with stronger causal support for cerebellar involvement in next-word prediction and reports of cerebellar contributions to syntactic processing (S-V-O recognition). These map onto broader cerebellar functions: prediction of external events and rule extraction from sequences. Conventional ANN language models (e.g., transformer-based) deviate from biological constraints and process sentences non-sequentially, limiting neuroscientific validity. Traditional cerebellar models are largely feedforward and do not incorporate recently identified recurrent nucleocortical pathways shown to be important for prediction. Clinical observations suggest developmental cerebellar involvement is critical for later neocortical language competence. Therefore, a biologically constrained, recurrent cerebellar circuit model trained on sequential word inputs is motivated to probe how prediction and syntax might co-emerge within one architecture.
Methodology
- Simulation and tools: Python on Google Colab; TensorFlow/Keras (v2.9/2.11). Code available at https://github.com/cANN-NLP/NLP_codes.
- Architecture: Three-layer cerebellar-inspired network: input (granule cells), middle (Purkinje cells), output (cerebellar nuclei). Includes feedforward and a recurrent pathway from output to input (both direct nucleocortical and indirect pathways). Implements an error-teaching climbing fiber pathway delivering prediction errors.
- Signal persistence: Based on physiology, Purkinje predictive signals persist until the actual event; the model assumes Purkinje activity persists until the next word to maintain recurrent integration and permit error computation.
- Inputs/outputs: Sparse one-hot coding for words. 3000 input cells map to the 3000 most frequent words (others mapped to an "unknown" 3001st). Correct answer (actual next word) shares the same 3000-dimensional one-hot format to remove external syntactic/semantic information. Output layer size matches correct-answer dimensionality (3000) to enable probabilistic next-word predictions; top-5 most active outputs used as candidates for evaluation.
- Training task: Next-word prediction from sentences (e.g., classic novels). At each word, the network predicts the next word; the correct next word is provided as teaching signal.
- Learning rule: Cross-entropy loss between softmaxed outputs and one-hot ground truth. Gradient descent update of synaptic weights on Purkinje inputs and output cells (W = W − ε ∂E/∂W). Neuronal nonlinearity via leaky ReLU: input→Purkinje and output→input use slope a=0.14 for negatives; Purkinje→output uses a=1.0 (quasi-linear). Recurrent output→input weights set as identity (relay) in the base model. Distinct feedforward and recurrent input cell populations (granule cells) are assumed.
- Analyses: PCA to visualize Purkinje dynamics across words; linear and nonlinear SVMs to classify S, V, O categories from Purkinje activity and assess separability along the feedforward pathway (input, Purkinje, output, and correct-answer signals). Functional tests include blocking the recurrent pathway to assess its necessity.
- Biologically constrained variants:
1) Recurrent-path convergence/divergence: 48 output cells and 192 recurrent input cells (compression and decompression), with plasticity in the recurrent pathway.
2) Inhibitory-restricted Purkinje→output connections (GABAergic constraint) with functional disinhibition via firing-rate modulation.
3) Mixed sign-restricted input→Purkinje projections (excitatory-limited mimicking direct parallel fiber inputs and inhibitory-limited mimicking indirect interneuron pathways), reproducing a synaptic weight distribution with a peak at zero (silent synapses).
- Convergent CANN (module-based): Enforces Purkinje→output convergence and population coding in the feedforward output: 16 output cells per module represent a single word by a 16-D code; 10 parallel modules share inputs. Correct answers represented in 16-D compressed word vectors derived from GloVe 100-D embeddings reduced to 16-D by PCA. After each prediction, only the module with output closest to the correct 16-D target receives the error signal (MOSAIC-like training). Evaluation uses top-5 modules’ outputs as five candidates for comparability with the base model.
- Additional implementation notes: Increasing recurrent input cells beyond 192 did not improve performance, suggesting data-limited synapse training. Recurrent pathway removal was tested pre- and post-training. One-hot input/target representations ensure no externally supplied syntax/semantics, isolating internal emergence of structure.
Key Findings
- Next-word prediction learning: Correct prediction rate rapidly increased from 0% to ~17% early in training and gradually plateaued around ~38.3% (IQR 37.8–38.6) top-5 accuracy. The network differentiated predictions for repeated words (e.g., distinct instances of “the”), indicating context integration beyond the immediately preceding word.
- Word-type dependent performance (non-convergent cANN):
• Noun after verb: 22.9% (median; IQR 21.0–23.4), lower than overall 38.3% likely due to many possible noun partners for verbs.
• Prepositions after verbs: 60.0% (56.7–63.3).
• Object pronouns after verbs: 56.5% (52.2–60.9).
• Nouns after prepositions: 43.0% (42.1–45.6).
• Adjectives after “be” verbs: 27.5% (25.0–30.0).
- Recurrent pathway necessity: Blocking the recurrent input collapsed Purkinje activity for repeated function words and abolished context-sensitive predictions; noun-after-verb prediction dropped to 2.1% (IQR 1.1–2.1).
- Emergence of syntax in Purkinje dynamics: Despite one-hot inputs and targets containing no syntactic information, Purkinje-layer activity robustly separated subjects, verbs, and objects. Classification accuracies (nonlinear SVM) were high: S vs others ~96.1% (95.7–96.6), V ~96.0% (95.7–96.5), O ~95.8% (95.4–96.1). The same word was distinguished by role (e.g., “the” as part of subject vs object). Syntactic information peaked at Purkinje cells and was degraded at the output layer; recurrent input was critical for syntactic separation.
- Primary representation: S-V-O separating axes aligned with major PCA dimensions in Purkinje activity, indicating syntax as a primary encoded feature.
- Robustness to biological constraints (variant models):
• Recurrent-path convergence/divergence variant: 36.5% median correct prediction; syntactic separation S 93.4%, V 95.1%, O 95.0% (n=4).
• Inhibitory-only Purkinje→output variant: 36.8% correct; S 94.8%, V 96.2%, O 95.5% (n=8).
• Mixed sign-restricted input→Purkinje variant: 37.0% correct; S 94.5%, V 96.6%, O 95.5% (n=4); excitatory synapses exhibited a peak at zero (silent synapses), matching physiology better.
- Convergent, modular cANN: With 10 modules (16-D outputs), top-5 modules achieved 26.4% (median; IQR 25.7–27.1) correct (33.6% using all 10). Noun-after-verb: 16.0% (14.9–17.0); prepositions after verbs: 65.0% (60.8–66.7). Purkinje-layer S, V, O separation remained high (S 94.9%, V 93.8%, O 92.6%), surpassing syntactic information present in the 16-D correct answer signal. Overall, both prediction and syntax functions persisted despite altered output coding and convergence constraints.
Discussion
The study shows that a single biologically constrained cerebellar-like recurrent circuit trained for next-word prediction can simultaneously develop robust syntactic representations (S-V-O) in its intermediate layer. This directly addresses how the cerebellum’s uniform circuitry might support distinct cognitive functions—prediction and rule extraction—through shared network dynamics. The recurrent pathway is essential for integrating information across multiple time steps, enabling both context-sensitive prediction and the emergence of syntactic structure from sequence statistics even when neither inputs nor teaching signals carry syntactic cues. The model’s alignment with anatomical and physiological evidence, including nucleocortical recurrence and error-driven climbing fiber teaching, strengthens its plausibility. Conceptually, within the internal model framework, the cerebellum not only predicts future events but also extracts structural features from past sequences; the latter has been underappreciated. Clinically and developmentally, the results support a role for cerebellum in scaffolding neocortical language acquisition by providing syntactic signals early on; deficits from childhood cerebellar lesions align with this view. The model predicts cross-function transfer: improving prediction may enhance syntactic comprehension, suggesting rehabilitative strategies. The computational principles likely generalize beyond language to other cerebellar-supported motor and cognitive domains that depend on prediction and sequence structure.
Conclusion
This work introduces a pioneering biologically constrained cerebellar ANN that achieves human-characteristic language functions: next-word prediction and emergent syntactic processing within the same circuit. The model integrates feedforward and recurrent pathways with error-based learning, demonstrating that recurrent network dynamics can unify prediction and rule extraction in a uniform cerebellar architecture. The findings are robust to multiple physiological constraints (connection signs, convergence/divergence, synaptic weight distributions) and coding formats (sparse versus compressed outputs). Future research should: (1) extend grammatical processing beyond S-V-O to word classes and more complex syntax; (2) test proposed pathways for exporting Purkinje-layer syntactic signals to other brain regions; (3) explore training regimens where prediction practice improves syntactic comprehension for clinical rehabilitation; and (4) leverage convergent CANN design as a brain-inspired AI architecture where rich outputs are learned from sparse inputs.
Limitations
- Modeling assumptions: Persistent Purkinje predictive signals across word intervals; identity-like relay in parts of the recurrent pathway in the base model; gradient-descent style synaptic updates approximating cerebellar learning.
- Input/target representations: One-hot sparse coding removes external semantic/syntactic information, simplifying the problem but limiting ecological validity; correct-answer signals in the base model contain no syntax and in the convergent model include only limited compressed semantics.
- Task scope: Syntactic emergence evaluated primarily for S-V-O; broader grammatical constructs and hierarchical syntax remain to be tested.
- Data and capacity: Increasing recurrent input size beyond 192 did not improve performance, suggesting dataset limitations; prediction accuracies vary by word class and are below state-of-the-art AI due to strict biological constraints.
- Output usage: Syntactic information degrades at the output layer; explicit biologically plausible pathways to read out intermediate-layer syntax were proposed but not implemented or validated experimentally.
- Convergent modular model: Lower overall prediction accuracy may reflect fewer learning opportunities per module; only one module learns per step (MOSAIC-like), which constrains efficiency.
Related Publications
Explore these studies to deepen your understanding of the subject.