logo
ResearchBunny Logo
Synthetic data generation for system identification: leveraging knowledge transfer from similar systems

Engineering and Technology

Synthetic data generation for system identification: leveraging knowledge transfer from similar systems

D. Piga, M. Rufolo, et al.

This innovative research by Dario Piga, Matteo Rufolo, Gabriele Maroni, Manas Mejari, and Marco Forgione explores a groundbreaking method for generating synthetic data, enhancing model generalization and robustness in system identification, especially when real training data is hard to come by. Discover how a pre-trained meta-model predicts system behavior and generates synthetic output!

00:00
00:00
~3 min • Beginner • English
Introduction
The paper tackles the problem of overfitting and poor generalization in system identification when only small, costly datasets are available. The research question is whether synthetic data, generated by transferring knowledge from similar systems via a pre-trained meta-model, can improve model estimation and robustness under data scarcity. The context is the broader use of data augmentation and synthetic data in machine learning to expand limited datasets; however, generating reliable synthetic data for dynamical systems is challenging. The authors propose leveraging a pre-trained Transformer that captures a class of systems to produce synthetic input-output trajectories tailored to the query system by using its observed input-output sequence as context. The purpose is to augment training data and regularize model estimation, with validation used to balance contributions of real and synthetic data and to enable early stopping. This approach aims to mitigate overfitting while maintaining or improving predictive performance.
Literature Review
The authors situate their work within two strategies for limited data: data augmentation and synthetic data generation [1], [2]. Prior system identification contributions include: (i) perturbation of nonlinear FIR regressors with manifold regularization [3]; (ii) time-series augmentation via jittering and slicing for maritime maneuver modeling [4]. Physics-Informed Neural Networks (PINNs) integrate physical constraints via PDE residuals at collocation points and can be interpreted as training with synthetic data-derived regularization [5]. The proposed method builds on a class-modeling paradigm using Transformers for in-context learning of dynamical system classes [6], leveraging attention-based architectures [7] foundational to large language models [8], [9], with positional encoding adaptations for real-valued sequences [10]. Related work on efficient differentiation of linear dynamical blocks supports training [13], and early stopping aligns with regularization theory in neural networks [14].
Methodology
The method assumes the query system belongs to a broad class of dynamical systems for which a meta-model (encoder-decoder Transformer) has been pre-trained using synthetic input-output sequences generated from simulators with randomly sampled configurations representative of the class. Pre-training minimizes an empirical supervised loss over many randomly sampled systems by splitting sequences into context and prediction segments and using mini-batch gradient descent. At inference, the available input-output training sequence from the query system is provided as context to the encoder. Given new input sequences drawn from the same distribution as the training inputs, the decoder generates synthetic output sequences, producing a potentially unbounded set of synthetic input-output trajectories for the query system. This can be viewed as knowledge transfer via zero-shot in-context learning: the meta-model infers system dynamics from the context and predicts outputs for new inputs. Model estimation: A parametric model M(·; θ) (e.g., a Wiener-Hammerstein model with two LTI blocks and a static nonlinearity) is fitted by minimizing a composite loss that combines the empirical loss on real training data and on synthetic data, weighted by a nonnegative hyperparameter γ. The objective consists of the average loss over the real training samples plus γ times the average loss over synthetic samples. The expectation over synthetic data is approximated using q synthetic sequences per optimization iteration. New synthetic sequences are generated on-the-fly at each minibatch step, and parameters are optimized via stochastic gradient descent. A separate validation set is used to select γ (grid search) and to implement early stopping to prevent overfitting, especially when γ is small or zero. The approach integrates the same validation data for both early stopping and hyperparameter selection, avoiding the need for additional data. Implementation details in the example: The meta-model is a Transformer with 12 layers, model dimension 128, 4 attention heads, context length m = 400, and approximately 5.6M parameters, pre-trained on simulated SISO Wiener-Hammerstein systems with LTI order up to 10. Synthetic outputs are generated for inputs of length T = 200 matching the training input distribution. The parametric model mirrors the class structure (Wiener-Hammerstein with LTI order 10 and a 32-unit hidden-layer nonlinearity; 137 parameters). Optimization uses minibatch SGD (q = 1) with up to 6000 iterations, fast differentiation for the linear blocks, and early stopping.
Key Findings
- Synthetic data acts as an effective regularizer: with γ = 0 (no synthetic data), validation MSE is about 5 times larger than training MSE (indicative of overfitting). For γ ≥ 10, training and validation MSEs become similar, evidencing reduced overfitting. - Using synthetic data (γ > 0) improves validation performance relative to γ = 0. However, when the synthetic loss contribution dominates by more than roughly an order of magnitude over real data, validation performance degrades, likely due to epistemic uncertainty in the meta-model’s outputs. - Test performance over 100 Monte Carlo runs improves substantially with synthetic data: median R² increases from 0.889 (no synthetic data) to 0.956 (with synthetic data, γ selected by hold-out validation). - The approach reliably enhances generalization in small-data settings for SISO Wiener-Hammerstein systems. The regularization effect is tunable via γ and benefits from early stopping.
Discussion
The study demonstrates that transferring knowledge from a pre-trained class meta-model to generate synthetic trajectories for a specific query system can mitigate overfitting when real data are scarce. By incorporating synthetic data into the training objective, the estimator balances fitting the noisy limited training set with patterns learned from analogous systems, improving generalization. The results show marked gains in validation and test accuracy (notably the increase in median R²), supporting the hypothesis that class-informed synthetic data enhances robustness. The sensitivity to the weighting γ highlights the need to calibrate the influence of synthetic data: moderate values reduce variance and prevent overfitting, while excessive reliance can introduce bias due to meta-model uncertainty. Overall, the findings are relevant for system identification scenarios where collecting long or diverse datasets is costly, offering a practical route to improved performance using class-level priors encoded by a Transformer.
Conclusion
The paper introduces a synthetic data generation and training framework for system identification that leverages a pre-trained Transformer meta-model of a system class. By conditioning on short query system datasets to generate additional trajectories, and by integrating these in a weighted training objective with early stopping, the method reduces overfitting and improves generalization. A numerical study on SISO Wiener-Hammerstein systems demonstrates significant gains, including an increase in median test R² from 0.889 to 0.956. Future research directions outlined by the authors include: (i) scaling and refining the meta-model to cover broader classes of dynamical systems; (ii) estimating uncertainty of the meta-model outputs to cast training with synthetic data as a maximum likelihood problem and to weight synthetic samples by reliability; (iii) integrating the meta-model output as a prior in Bayesian estimators (e.g., Gaussian Process Regression), especially in sparsely observed regions of the input space.
Limitations
- The effectiveness depends on the query system belonging to the class represented by the pre-trained meta-model; mismatch can introduce bias. - Synthetic data quality is limited by epistemic uncertainty of the meta-model; excessive weighting of synthetic data degrades performance. - The experimental meta-model targets SISO Wiener-Hammerstein systems with LTI order up to 10; generalization to other classes or MIMO systems is not demonstrated. - Evaluation uses a large, noise-free test set and relatively small training/validation sets; although intended to strengthen statistical conclusions, this setup may not reflect typical practical conditions. - Hyperparameter tuning (γ) and early stopping rely on a validation set; performance is sensitive to this selection and to the chosen synthetic input distributions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny