Engineering and Technology

Automatically discovering ordinary differential equations from data with sparse regression

K. Egan, W. Li, et al.

Discover how Kevin Egan, Weizhen Li, and Rui Carvalho are transforming the identification of nonlinear differential equations from data! Their innovative methodology combines denoising, sparse regression, and bootstrapping, allowing for automated discovery of dynamical laws with minimal manual tuning. This research has the potential to revolutionize our understanding of complex systems.

00:00

Playback language: English

Index

Introduction

The quest to formulate mathematical models, specifically differential equations, representing natural phenomena has been a central theme in science since Newton's second law. Dynamical systems modeling has found widespread application across diverse fields including physics, chemistry, biology, neuroscience, epidemiology, ecology, and environmental science. However, developing these models remains challenging and demands significant expertise. Since the 1980s, statistical methods have been employed to reverse engineer governing equations from data, a process known as the inverse problem or system identification. Symbolic regression, particularly sparse regression techniques like Sparse Identification of Nonlinear Dynamics (SINDy), have advanced this field by automating the selection of relevant terms from a high-dimensional space, leading to more interpretable models. While SINDy and its variants, such as SINDy with AIC, have shown promise, they often suffer from limitations including reliance on manual hyperparameter tuning, sensitivity to noise, and the need for high-quality data. This paper addresses these limitations by introducing an automated approach that removes the need for extensive manual tuning.

Literature Review

Early attempts at reverse-engineering governing equations from data relied on statistical methods, aiming to automate the process of model discovery. Symbolic regression significantly improved the ability to develop interpretable models of complex systems. Sparse regression techniques, such as SINDy, emerged as a practical solution, eliminating the need for manual equation determination. SINDy utilizes a sparsity-promoting framework to identify the most dominant terms from a high-dimensional nonlinear function space. Subsequent advancements in SINDy included the integration of Bayesian sparse regression, ensemble methods, neural networks, and tools for handling noisy data. A notable variant, SINDy with AIC, attempted to automate model selection using the Akaike Information Criterion (AIC). However, it faces challenges such as dependence on prior knowledge for validation and limited capability in handling noisy or limited data. Many existing methods rely on the Savitzky-Golay filter for noise reduction and derivative computation, but this requires manual parameter selection, highlighting the need for a fully automated approach.

Methodology

The proposed method, called ARGOS (Automatic Regression for Governing Equations), integrates machine learning with statistical inference to automatically identify interpretable models. ARGOS comprises several key phases. First, it employs the Savitzky-Golay filter to smooth the data and numerically approximate derivatives, automatically optimizing filter parameters (polynomial order and window length) to minimize the mean squared error between noisy and smoothed signals. This automated optimization eliminates the need for manual tuning of filter parameters. After smoothing and differentiation, a design matrix is constructed using monomials up to a specified degree and potential nonlinear functions. Sparse regression, using either lasso or adaptive lasso, is applied to identify non-zero coefficients, effectively selecting relevant variables. Following an initial sparse regression estimate, the design matrix is trimmed to include only terms with non-zero coefficients. This trimmed matrix is then used to reapply sparse regression with a grid of thresholds, generating a set of models. Ordinary Least Squares (OLS) is performed on the selected variables from each model subset, and the model with the minimum Bayesian Information Criterion (BIC) is selected. Finally, bootstrapping with 2000 samples is performed to generate 95% confidence intervals, allowing for robust variable selection by identifying variables whose confidence intervals do not include zero. This process ensures the selection of only significant terms in the model, creating a more accurate and robust representation of the underlying dynamics. The lasso and adaptive lasso are used for variable selection; the adaptive lasso is specifically noted for its suitability in identifying nonlinear ODEs in higher-dimensional systems.

Key Findings

ARGOS was evaluated against SINDy with AIC on a range of well-known ODEs, using 100 random initial conditions, varying time series lengths, and signal-to-noise ratios (SNRs). A success rate metric was defined as the proportion of times the correct governing equation terms were identified. ARGOS consistently outperformed SINDy with AIC, particularly for three-dimensional systems. Success rates exceeding 80% were achieved for several systems with moderately sized datasets (less than 800 data points for linear systems) and moderate SNRs. The adaptive lasso version of ARGOS showed superior performance in identifying nonlinear ODEs in three dimensions. The analysis highlighted the importance of data quality and quantity; accuracy improved with increasing time series length and SNR. Figure 2 visually displays the superior success rate of ARGOS compared to SINDy with AIC across various scenarios. Figure 3 illustrates the frequency of correctly and incorrectly identified terms across different data conditions. The study also revealed that ARGOS’ performance can degrade in completely noiseless environments due to a violation of the homoscedasticity assumption in linear regression, though this is mitigated by the presence of even small amounts of noise in the data. Figure 4 shows the relationship between the residuals and fitted values under different noise levels. The computational time comparison (Figure 5) shows that while ARGOS can be slightly slower than SINDy with AIC for low-dimensional systems, it shows better scalability for higher-dimensional systems. A comparison with Ensemble-SINDy (ESINDy), a more recent variant, further highlighted the advantages of ARGOS' automated hyperparameter tuning and consistent performance across various conditions (Figure 6). Figures 7 and S6 (Supplementary material) demonstrate the variability of hyperparameters like lambda in the lasso and adaptive lasso, emphasizing the impracticality of manual tuning.

Discussion

ARGOS successfully addresses the limitations of existing methods by automating the process of identifying governing equations from data. The integration of the Savitzky-Golay filter, sparse regression, and bootstrapping provides a robust and efficient approach, especially for higher-dimensional systems. The superior performance compared to SINDy with AIC and ESINDy demonstrates the effectiveness of the automated hyperparameter tuning and statistical inference framework employed in ARGOS. The consistent performance across various data conditions, including different initial conditions, time series lengths, and SNRs, highlights the robustness and generalizability of the method. The findings underscore the importance of data quality and quantity in model discovery, although ARGOS shows resilience even with moderate data quality and quantity. However, the limitations of the method, particularly the need for the true governing terms to be present in the design matrix, remain. The study contributes to the broader field of data-driven model discovery by providing a reliable and automated method for extracting dynamical systems from observational data.

Conclusion

ARGOS provides an automated method for discovering dynamical systems from limited and noisy data. It consistently outperforms existing methods like SINDy with AIC, particularly for higher-dimensional systems. The integration of automated hyperparameter tuning, sparse regression, and bootstrapping leads to a more robust and efficient approach to model discovery. Future research could explore extensions to handle even higher-dimensional systems or incorporate more sophisticated model selection criteria.

Limitations

While ARGOS automates model discovery, it assumes the presence of active terms within the design matrix. Performance improves with increased data quantity and quality; limitations arise with severely limited data or very low SNR. The homoscedasticity assumption of linear regression might be violated in noiseless datasets, leading to the inclusion of spurious terms. Furthermore, the computational cost of bootstrapping can increase significantly with large datasets and higher dimensionality, limiting real-time applicability.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Forecasting the outcome of spintronic experiments with Neural Ordinary Differential Equations

X. Chen, F. A. Araujo, et al.

Biology

Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo

D. Lähnemann, J. Köster, et al.

Education

Metacognitive reading strategies and its relationship with Filipino high school students’ reading proficiency: insights from the PISA 2018 data

A. B. I. Bernardo and M. J. Mante-estacio

Engineering and Technology

Extracting accurate materials data from research papers with conversational language models and prompt engineering

M. P. Polak and D. Morgan

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny