Engineering and Technology

Synthetic data generation for system identification: leveraging knowledge transfer from similar systems

D. Piga, M. Rufolo, et al.

This innovative research by Dario Piga, Matteo Rufolo, Gabriele Maroni, Manas Mejari, and Marco Forgione explores a groundbreaking method for generating synthetic data, enhancing model generalization and robustness in system identification, especially when real training data is hard to come by. Discover how a pre-trained meta-model predicts system behavior and generates synthetic output!

00:00

Playback language: English

Index

Introduction

System identification algorithms, like other machine learning tools, heavily rely on the quality and quantity of training data. Data scarcity leads to overfitting, especially with complex models. This paper addresses this challenge using synthetic data generation. Two main strategies to overcome data scarcity are data augmentation and synthetic data generation. Data augmentation modifies existing data samples to create variations, but is limited by the original data's characteristics. Synthetic data generation creates entirely new datasets mimicking real-world data, overcoming data augmentation's limitations. However, generating reliable synthetic data can be challenging. While data augmentation and synthetic data generation have been explored in other machine learning fields, their application in system identification is limited. Existing work focuses on techniques like perturbing regressors or modifying time sequences through jittering and slicing. Physics-informed deep learning, integrating physical laws into the training, is another related approach. This paper proposes a novel method leveraging a pre-trained meta-model—a Transformer network trained on a wide class of systems—to generate synthetic data for a specific system of interest (the 'query system'). The meta-model implicitly learns the dynamics of the query system from its limited training data and generates synthetic data for improved model estimation.

Literature Review

The paper reviews existing literature on data augmentation and synthetic data generation in machine learning and system identification. It mentions data augmentation techniques used in image and natural language processing and highlights the broader applicability of synthetic data. It cites studies on data augmentation in system identification, such as the work by Formentin et al. [3] on nonlinear finite impulse response systems and Wakita et al. [4] on data augmentation for dynamical models of harbor maneuvers. The paper also positions its work within the context of physics-informed deep learning [5], which uses physical laws to constrain model solutions. The authors' previous work on a meta-model using Transformers [6] is introduced as the foundation for the proposed approach. The relevant literature on Transformers in natural language processing [7] and large language models [8, 9] is also referenced.

Methodology

The proposed methodology centers on a pre-trained meta-model—a Transformer network with an encoder-decoder architecture—that describes a broad class of dynamical systems. This meta-model is trained on a large dataset of synthetic data generated from simulators with varying configurations. The encoder processes an input-output sequence from the query system (the training data), which acts as context for the meta-model. The decoder, given a new input sequence, predicts the corresponding output sequence, effectively generating synthetic data. The training data from the query system serves a dual purpose: as context for the meta-model and as part of the loss function for model estimation. Model parameters are estimated by minimizing a loss function that considers both the original training data and the generated synthetic data. A hyperparameter γ balances the contribution of these two data sources. This hyperparameter is tuned using a validation dataset, which also serves for early stopping to prevent overfitting, especially with small training datasets. The expected value in the loss function is approximated using multiple synthetic sequences generated at each iteration of the optimization algorithm.

Key Findings

The paper demonstrates the effectiveness of its approach through a numerical example focusing on identifying Wiener-Hammerstein systems. A data-generating system, consisting of linear time-invariant blocks and a nonlinearity (a neural network), was used to create training, validation, and test datasets. The pre-trained Transformer meta-model, trained on a class of Wiener-Hammerstein systems, generated synthetic data for the query system. A Wiener-Hammerstein model with 137 parameters was used as the parametric model. The hyperparameter γ was selected using a grid search to minimize the mean squared error (MSE) on the validation dataset. Results show that using synthetic data significantly improves model performance. The MSE on the validation set is substantially lower when synthetic data is incorporated, demonstrating the regularization effect. The R² coefficient on the test dataset also shows a significant improvement (from 0.889 to 0.956 median) when synthetic data is used, highlighting the improved generalization ability of the model. The results demonstrate that the inclusion of synthetic data prevents overfitting, leading to improved model accuracy, particularly when the training data is limited.

Discussion

The results confirm the hypothesis that using synthetic data generated from a pre-trained meta-model improves system identification in data-scarce scenarios. The significant improvement in both validation and test performance indicates that the knowledge transfer from the meta-model effectively supplements the limited training data. The choice of the hyperparameter γ is crucial in balancing the influence of real and synthetic data. Too much emphasis on synthetic data can lead to a decrease in performance. The success of the method depends on the quality of the meta-model and its ability to accurately represent the dynamics of the query system. The study’s findings underscore the potential of synthetic data generation, leveraging knowledge transfer, for enhancing system identification accuracy and robustness.

Conclusion

This paper presented a novel method for generating synthetic data in system identification by leveraging knowledge transfer from a pre-trained meta-model. The approach effectively mitigates the overfitting problem associated with limited training data. Experimental results demonstrated a significant improvement in model performance compared to using only real training data. Future work will focus on improving the meta-model to accommodate broader classes of systems, quantifying the uncertainty associated with synthetic data outputs for more robust model estimation, and integrating the meta-model's output as a prior in Bayesian estimation algorithms.

Limitations

The performance of the method depends heavily on the quality and representativeness of the pre-trained meta-model. If the meta-model does not accurately capture the characteristics of the query system's class, the generated synthetic data may be less effective. The choice of the hyperparameter γ requires careful tuning, and a robust method for automated selection could further improve the approach. The computational cost of training the meta-model can be significant, although this is a one-time cost that can be amortized over multiple system identification tasks.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

From insights to impact: leveraging data analytics for data-driven decision-making and productivity in banking sector

R. Gul and M. A. S. Al-faryan

Engineering and Technology

A robust synthetic data generation framework for machine learning in high-resolution transmission electron microscopy (HRTEM)

L. R. Dacosta, K. Sytwu, et al.

Medicine and Health

A method for intelligent allocation of diagnostic testing by leveraging data from commercial wearable devices: a case study on COVID-19

M. M. H. Shandhi, P. J. Cho, et al.

Veterinary Science

Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations

M. Thißen, T. N. D. Tran, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny