Introduction
System identification algorithms, like other machine learning tools, heavily rely on the quality and quantity of training data. Data scarcity leads to overfitting, especially with complex models. This paper addresses this challenge using synthetic data generation. Two main strategies to overcome data scarcity are data augmentation and synthetic data generation. Data augmentation modifies existing data samples to create variations, but is limited by the original data's characteristics. Synthetic data generation creates entirely new datasets mimicking real-world data, overcoming data augmentation's limitations. However, generating reliable synthetic data can be challenging. While data augmentation and synthetic data generation have been explored in other machine learning fields, their application in system identification is limited. Existing work focuses on techniques like perturbing regressors or modifying time sequences through jittering and slicing. Physics-informed deep learning, integrating physical laws into the training, is another related approach. This paper proposes a novel method leveraging a pre-trained meta-model—a Transformer network trained on a wide class of systems—to generate synthetic data for a specific system of interest (the 'query system'). The meta-model implicitly learns the dynamics of the query system from its limited training data and generates synthetic data for improved model estimation.
Literature Review
The paper reviews existing literature on data augmentation and synthetic data generation in machine learning and system identification. It mentions data augmentation techniques used in image and natural language processing and highlights the broader applicability of synthetic data. It cites studies on data augmentation in system identification, such as the work by Formentin et al. [3] on nonlinear finite impulse response systems and Wakita et al. [4] on data augmentation for dynamical models of harbor maneuvers. The paper also positions its work within the context of physics-informed deep learning [5], which uses physical laws to constrain model solutions. The authors' previous work on a meta-model using Transformers [6] is introduced as the foundation for the proposed approach. The relevant literature on Transformers in natural language processing [7] and large language models [8, 9] is also referenced.
Methodology
The proposed methodology centers on a pre-trained meta-model—a Transformer network with an encoder-decoder architecture—that describes a broad class of dynamical systems. This meta-model is trained on a large dataset of synthetic data generated from simulators with varying configurations. The encoder processes an input-output sequence from the query system (the training data), which acts as context for the meta-model. The decoder, given a new input sequence, predicts the corresponding output sequence, effectively generating synthetic data. The training data from the query system serves a dual purpose: as context for the meta-model and as part of the loss function for model estimation. Model parameters are estimated by minimizing a loss function that considers both the original training data and the generated synthetic data. A hyperparameter γ balances the contribution of these two data sources. This hyperparameter is tuned using a validation dataset, which also serves for early stopping to prevent overfitting, especially with small training datasets. The expected value in the loss function is approximated using multiple synthetic sequences generated at each iteration of the optimization algorithm.
Key Findings
The paper demonstrates the effectiveness of its approach through a numerical example focusing on identifying Wiener-Hammerstein systems. A data-generating system, consisting of linear time-invariant blocks and a nonlinearity (a neural network), was used to create training, validation, and test datasets. The pre-trained Transformer meta-model, trained on a class of Wiener-Hammerstein systems, generated synthetic data for the query system. A Wiener-Hammerstein model with 137 parameters was used as the parametric model. The hyperparameter γ was selected using a grid search to minimize the mean squared error (MSE) on the validation dataset. Results show that using synthetic data significantly improves model performance. The MSE on the validation set is substantially lower when synthetic data is incorporated, demonstrating the regularization effect. The R² coefficient on the test dataset also shows a significant improvement (from 0.889 to 0.956 median) when synthetic data is used, highlighting the improved generalization ability of the model. The results demonstrate that the inclusion of synthetic data prevents overfitting, leading to improved model accuracy, particularly when the training data is limited.
Discussion
The results confirm the hypothesis that using synthetic data generated from a pre-trained meta-model improves system identification in data-scarce scenarios. The significant improvement in both validation and test performance indicates that the knowledge transfer from the meta-model effectively supplements the limited training data. The choice of the hyperparameter γ is crucial in balancing the influence of real and synthetic data. Too much emphasis on synthetic data can lead to a decrease in performance. The success of the method depends on the quality of the meta-model and its ability to accurately represent the dynamics of the query system. The study’s findings underscore the potential of synthetic data generation, leveraging knowledge transfer, for enhancing system identification accuracy and robustness.
Conclusion
This paper presented a novel method for generating synthetic data in system identification by leveraging knowledge transfer from a pre-trained meta-model. The approach effectively mitigates the overfitting problem associated with limited training data. Experimental results demonstrated a significant improvement in model performance compared to using only real training data. Future work will focus on improving the meta-model to accommodate broader classes of systems, quantifying the uncertainty associated with synthetic data outputs for more robust model estimation, and integrating the meta-model's output as a prior in Bayesian estimation algorithms.
Limitations
The performance of the method depends heavily on the quality and representativeness of the pre-trained meta-model. If the meta-model does not accurately capture the characteristics of the query system's class, the generated synthetic data may be less effective. The choice of the hyperparameter γ requires careful tuning, and a robust method for automated selection could further improve the approach. The computational cost of training the meta-model can be significant, although this is a one-time cost that can be amortized over multiple system identification tasks.
Related Publications
Explore these studies to deepen your understanding of the subject.