Physics
Learning stochastic dynamics and predicting emergent behavior using transformers
C. Casert, I. Tamblyn, et al.
This innovative research by Corneel Casert, Isaac Tamblyn, and Stephen Whitelam reveals how a transformer neural network, originally created for processing language, can master the complex dynamics of stochastic systems by simply observing a trajectory. Their groundbreaking work predicts unseen emergent behaviors, opening doors to understanding complex systems without traditional modeling techniques.
~3 min • Beginner • English
Introduction
Learning the dynamics governing a simulation or experiment is a difficult task, because the number of possible dynamical transitions increases exponentially with system size. For large systems, these transitions are too numerous to enumerate explicitly, and one usually coarse-grains or projects the system’s dynamical degrees of freedom into a smaller subspace that can be learned. Here, the authors present a dynamics-learning method that does not require projection or coarse-graining, even for large systems. They show that a transformer, popular in natural-language processing and computer vision, can express general stochastic dynamics without coarse-graining over degrees of freedom or preselecting a subspace of processes to learn. Trained offline by observation only, the transformer learns the rules of a model even when rules are numerous and nonlocal, and forward-propagated trajectories can reproduce observed behavior and predict behavior under unseen conditions. Previous work has demonstrated learning deterministic dynamics (e.g., cellular automata) or stochastic dynamics for small state spaces via maximum-likelihood estimation, learning intermolecular potentials, discovering physical laws via machine learning, and predicting fluid dynamics with physics-informed neural networks. High-dimensional systems are often approximated by lower-dimensional models such as Markov state models, sometimes with variational methods and neural networks. The present work addresses learning full stochastic dynamics directly from raw trajectories without explicit coarse-graining.
Literature Review
Prior approaches include: maximum-likelihood estimation of generators for small state spaces; learning deterministic cellular automata and stochastic dynamics; rediscovery of physical force laws (e.g., gravity) via machine learning and symbolic regression; physics-informed neural networks for fluid dynamics; Markov state models and variational methods for coarse-graining with neural networks. These methods either assume low-dimensional representations, introduce architectural biases (e.g., CNN locality and translational invariance), or require explicit state-space enumeration. The authors position transformers, with attention mechanisms, as capable of learning relevant interactions (local or long-range) and symmetries directly from trajectories, avoiding fixed coarse-graining choices.
Methodology
Problem setup: Given a single continuous-time Monte Carlo (CTMC) trajectory ω of total time T, starting from configuration C0 and visiting K subsequent configurations with residence times Δt_k, the original transition rates W_{C→C′} are unknown. The goal is to learn these rates from observation only.
Synthetic dynamics and objective: Introduce a synthetic CTMC whose allowed moves (C→C′) include those observed in ω. Parameterize rates W_{C→C′}^{(θ)} with a neural network (transformer). Rather than generating many synthetic trajectories to match statistics, train by maximizing the log-likelihood U^{(θ)} that the synthetic dynamics would have produced the observed trajectory ω. For sufficiently long trajectories, maximizing U^{(θ)} recovers the original dynamics.
Model system: A 2D periodic square lattice of size L×L with N volume-excluding particles (density φ=N/L²). Each particle has an orientation pointing to one of four neighbors. Orientations rotate ±π/2 at rate D. A particle attempts to translate one lattice site if oriented toward that site; moves to a vacant site with rate r_ei (preferred direction) and with rate ω otherwise (details as described), subject to volume exclusion. The steady state shows dispersed clusters at low φ; at higher φ and sufficiently large ω the system undergoes motility-induced phase separation (MIPS).
Assumptions provided to the learner: The transformer is told only that rates are time-independent and that moves are single-particle rotations/translations (both relaxable). It is not told the numerical values of rates or locality; in principle, rates may depend on the entire configuration.
Training data and conditions: Single training trajectory with T=5×10² on a 30×30 lattice at φ=0.124, ν=10^-4, D=0.1, where MIPS is absent.
Training modes: Mode 1 (unrestricted): The transformer predicts a separate rate for each possible particle move directly. Mode 2 (class-based): The transformer classifies each possible transition into one of N_θ classes; a second network assigns the class rate. N_θ is a complexity-control hyperparameter. As N_θ increases, maximum achievable log-likelihood increases up to a plateau indicating the minimal number of distinct processes in the original dynamics.
Architecture: Transformer with multi-head attention to learn which particles/interactions affect each transition rate, enabling discovery of local or long-range dependencies and symmetries (e.g., translational invariance) without bias. Inputs are sequences of particles (positions and orientations); positions (x,y) and spins are embedded into d_emb-dimensional vectors using trainable matrices and summed per particle. Empty lattice sites are not explicitly encoded; neighborhood occupancy must be inferred via positional embeddings and attention. Attention uses scaled dot-product queries, keys, values; outputs per-particle representations pass through feed-forward layers. For Mode 1, a shared fully connected head maps per-particle outputs to rates for each possible update. For Mode 2, a classifier with N_w outputs and softmax assigns a class per move; a straight-through estimator enables backpropagation through the discrete class choice; a per-class regressor outputs the rate. The architecture naturally handles varying particle counts, enabling application to different densities without retraining system-size-specific layers.
Optimization: Maximize U^{(θ)} via gradient descent/backpropagation. AdaBelief optimizer with learning rate 1e-4; hyperparameters include d_h=6 and η=2. Training uses mini-trajectory segments over multiple epochs to improve efficiency and capture rare events. Initialization uses weights from prior transformer layers to speed convergence.
Validation and analysis: Compare learned log-likelihood to that computed from the original dynamics (unavailable to the network during training). Analyze the number of distinct rate classes via Mode 2 by scanning N_θ. Forward-propagate the learned dynamics to generate trajectories at multiple densities, including those not seen during training, and quantify phase separation using metrics: fraction f_4 of particles with four occupied neighbors and its variance, number of clusters n_c, and average cluster size ⟨s⟩.
Key Findings
- The transformer trained on a single trajectory accurately learns the stochastic dynamics of the lattice active-matter model, achieving a trajectory log-likelihood U^{(θ)} that matches that of the original dynamics, indicating numerically close rates.
- Complexity discovery: In Mode 2, the learned dynamics plateaus at N_θ=4 classes of moves, correctly identifying the model’s four distinct, translationally invariant processes (e.g., two rotation types and two translation types consistent with the original rules). The inset of Fig. 2b shows learned rates converging to original values.
- Locality and symmetry: Without being told locality, the transformer infers that interactions are local and translation-invariant. It can also learn long-range interactions when present (demonstrated in SI).
- Generalization and emergent behavior: When forward-propagated at densities not seen during training (e.g., φ=0.3, 0.5), the learned dynamics reproduces motility-induced phase separation (MIPS) seen in the original model, despite being trained at φ=0.124 where no phase separation occurs. Time-ordered snapshots show clustering comparable to the original model at φ=0.5.
- Quantitative agreement: Across densities, time-averaged observables—f_4 (fraction with four occupied neighbors), its variance, number of clusters n_c, and average cluster size ⟨s⟩—match original dynamics closely, with Mode 2 slightly outperforming Mode 1 (Fig. 4). Training used trajectories of length up to 10^4 for evaluation averages.
- Data efficiency and scalability: The approach avoids explicit enumeration of rates, which would be astronomically large (on the order of ~10^80 possible rates for illustrative system sizes). The method learns from a single trajectory and scales to large configuration spaces.
- Handling constraints: The network learns suppression of forbidden processes (e.g., those violating volume exclusion), though learned rates may be small but nonzero, leading to rare overlaps in forward simulations; Monte Carlo-based postprocessing or constraints can enforce exact zeros (SI Table S1).
- Applicability: Supplementary results indicate successful learning on another model (a supercooled liquid) and the capability to represent nonlocal rules.
Discussion
The study addresses the challenge of learning full stochastic dynamics in many-body systems without explicit coarse-graining or enumeration. By maximizing the path log-likelihood of a single observed CTMC trajectory with a transformer parameterization, the approach identifies both the structure (number of distinct processes) and the numerical values of transition rates. The model captures symmetries (translational invariance) and locality when present, yet remains flexible enough to learn long-ranged interactions. Critically, the learned dynamics generalize beyond training conditions: forward simulations at unseen densities reproduce motility-induced phase separation and its quantitative signatures, demonstrating prediction of emergent behavior. This ability to learn from observation-only data and to extrapolate to regimes with different macroscopic behavior has significant implications for studying complex physical systems where direct modeling is infeasible. The approach provides interpretable insights (e.g., number of processes, rate values) and a generative tool for simulating dynamics under new conditions.
Conclusion
Transformers can serve as expressive surrogates for the generators of many-body stochastic dynamics, enabling learning directly from single-trajectory observations without coarse-graining or explicit rate enumeration. On a lattice active-matter model, the method recovers the original dynamics (including its four distinct processes), learns symmetries and locality, and predicts motility-induced phase separation at densities not present during training, with strong quantitative agreement to the original dynamics. The framework is flexible and extensible: constraints (e.g., locality) can reduce parameter counts when known; attention allows learning long-range interactions when present; and the architecture naturally handles varying particle counts and densities. Future research directions include extending to off-lattice systems with suitable move parameterizations, enforcing hard constraints (e.g., exact zero rates for forbidden moves) during training, improving differentiable class assignment, exploring uncertainty quantification for learned rates, and applying the method to experimental trajectories and more complex nonlocal dynamics.
Limitations
- The learned move set was restricted to single-particle rotations and translations and assumed time-independent rates; although relaxable, this prior restricts hypothesis space during reported experiments.
- The transformer in this work takes lattice-based configurations as inputs; applying to off-lattice systems requires additional parameterization of microscopic moves.
- Finite training data: Learning from a single finite-length trajectory can leave small residual errors; forbidden processes may acquire small nonzero learned rates, causing occasional particle overlaps in forward simulations.
- Mode 2’s class selection uses a straight-through estimator, which introduces approximation bias in gradient estimation.
- While the model inferred locality, incorporating known finite interaction ranges could further reduce parameters; absent such priors, learning may be less data-efficient.
- The approach assumes access to complete system configurations over time; partial observability or noisy measurements may require adaptations.
Related Publications
Explore these studies to deepen your understanding of the subject.

