Engineering and Technology
Automatically discovering ordinary differential equations from data with sparse regression
K. Egan, W. Li, et al.
The paper addresses the challenge of automatically discovering governing nonlinear ordinary differential equations (ODEs) from data, a long-standing goal in science since the formulation of Newton’s laws. Although dynamical systems are widely used across disciplines (physics, chemistry, biology, neuroscience, epidemiology, ecology, environmental sciences), constructing accurate models typically requires substantial expert effort. The research question is how to infer interpretable dynamical laws from scarce and noisy observations without extensive manual tuning. The authors propose ARGOS, an automated methodology that integrates data denoising, sparse regression, and bootstrap inference to identify governing equations reliably. The study evaluates performance across ensembles of random initial conditions, varying time-series lengths, and different signal-to-noise ratios (SNRs), showing that ARGOS can consistently identify three-dimensional systems with moderately sized datasets and sufficiently high signal quality.
Data-driven discovery of governing equations has a history dating back to the 1980s with inverse problems and system identification, and has advanced through symbolic regression and sparse methods. The SINDy framework introduced sparse identification of nonlinear dynamics, selecting dominant terms from large candidate libraries and spawning numerous variants (Bayesian sparse regression, ensemble methods, neural network integrations, and approaches for noisy data). SINDy with AIC aimed to automate model selection via sparsity thresholds and information criteria, but it relies on prior knowledge for validation, struggles with high noise when computing derivatives from new data, and has been demonstrated under limited conditions (specific initializations, sufficient observations, low noise). Many existing approaches employ Savitzky–Golay filtering to reduce noise and compute derivatives but require manual parameter choices, and robust variable selection remains critical. The authors position their contribution as an automated process that tunes Savitzky–Golay parameters, applies sparse regression, and uses bootstrap confidence intervals to establish governing terms, improving identification accuracy and efficiency in low-to-medium noise conditions under sparsity and library-coverage assumptions.
Modeling with linear regression: Let x(t) in R^m be the state and f(x(t)) the dynamics. The authors approximate each component’s time-derivative by a sparse linear combination of candidate functions: ẋ_j ≈ Φ(x) β_j, where Φ(x) collects p candidate functions (monomials up to degree d and optional nonlinear functions such as trigonometric, logarithmic, or exponential). From measurements at times t_1,…,t_n, they build the state matrix X ∈ R^{n×m}, apply Savitzky–Golay (SG) filtering to each column to obtain smoothed states and numerical derivatives, and assemble the design matrix Θ(X) including constant, X, monomials X^[2], …, X^[d], and optional Φ(X). They pose a linear regression ẋ = Θ(X) B + E, estimating coefficients B.
ARGOS pipeline: 1) Smoothing and differentiation. Use Savitzky–Golay filtering with polynomial order o = 4 and a grid of window lengths l. For each state component x_j, select l minimizing mean squared error between the noisy x_j and its smoothed counterpart. Compute numerical derivatives from the smoothed signal. 2) Initial sparse regression. For each equation j, solve ẋ_j = Θ(X) β_j + ε_j using either lasso or adaptive lasso. 3) Trim design matrix. Identify the highest monomial order with a nonzero coefficient in the initial sparse estimate and drop higher-order terms from Θ(X). 4) Thresholded model subsets and OLS refit. Re-apply the sparse method on the trimmed Θ(X), then use a grid of hard thresholds γ to form subsets containing coefficients whose absolute values exceed their thresholds. For each subset, refit by ordinary least squares (OLS) on selected variables to remove shrinkage bias and select the model minimizing BIC. 5) Bootstrap inference. Bootstrap the sparse-regression process with the trimmed Θ(X) to obtain 2000 sample estimates. Construct 95% bootstrap confidence intervals, and select variables whose intervals exclude zero and whose point estimates lie within those intervals.
Regularization and adaptive lasso details (Methods): Sparse regression is implemented by minimizing squared error plus a weighted ℓ_q penalty with tuning parameter λ and weights w_k: argmin_β { Σ_i (x_i − β_0 − Σ_k X_{ik} β_k)^2 + λ Σ_k w_k |β_k|^q }. The lasso uses q = 1 and w_k = 1; ridge uses q = 2. The adaptive lasso uses weights w_k = 1/|β̂_k| based on pilot estimates to approximate nonconvex penalties and enhance variable selection. To mitigate multicollinearity, ridge regression is first used to obtain stable pilot β̂ for the adaptive lasso. Tuning λ is done via glmnet with 10-fold cross-validation on a default grid, then refined to 100 points around the optimal range; the best λ is re-selected by cross-validation. After sparse selection, OLS refits the chosen variables for unbiased coefficients. Hyperparameters for SG (window length) are chosen by an automated grid search minimizing MSE; sparse regression thresholds γ are selected by minimizing BIC. The overall process is fully automated and repeated per equation with bootstrap sampling for uncertainty quantification.
- ARGOS achieved high success rates in identifying the correct governing terms across multiple benchmark ODEs, often exceeding 80% under moderate-to-high SNR and with moderately sized datasets. The evaluation used 100 random initial conditions per system, varying time-series lengths (n = 10^2, 10^3, 10^4, and n = 5000 for SNR sweeps) and SNR values (e.g., 1 dB, 25 dB, 49 dB, 61 dB).
- Linear systems were accurately represented with fewer than 800 data points and medium SNR, showing strong performance for straightforward dynamics.
- In two-dimensional nonlinear systems, ARGOS with lasso identified 3 of 5 systems with moderate data sizes or medium SNRs.
- In three-dimensional nonlinear systems (e.g., Rössler, Lorenz), the adaptive lasso variant of ARGOS achieved higher accuracy than competing algorithms, highlighting its suitability for higher-dimensional, nonlinear ODEs.
- Compared to SINDy with AIC, ARGOS more consistently discovered correct models, particularly in 3D systems, where SINDy’s OLS-based thresholding was sensitive to multicollinearity and prone to eliminating true terms early.
- Diagnostics showed that in noiseless settings (SNR = ∞) for the Lorenz system, residuals violated homoscedasticity, leading ARGOS to select ancillary terms; slight noise restored homoscedasticity and improved correct identification.
- Computational efficiency: With increasing n, ARGOS scaled more favorably than SINDy with AIC for the Lorenz system. While SINDy with AIC was faster on a 2D linear system at small n, its exhaustive model enumeration grew more expensive with dimension, whereas ARGOS maintained a similar growth rate across systems.
- Against Ensemble-SINDy (ESINDy), ARGOS avoided extensive hyperparameter tuning. ESINDy’s performance varied with n and SNR and required careful adjustment of multiple thresholds and inclusion probabilities, whereas ARGOS automatically tuned λ, threshold γ (via BIC), and smoothing parameters.
- The ARGOS pipeline uses 2000 bootstrap samples to form 95% confidence intervals, enabling inference-based variable selection that improved robustness to noise and reduced spurious inclusions.
The findings show that statistical inference integrated with automated smoothing and sparse regression enables reliable discovery of governing equations from observational data. ARGOS addresses derivative estimation (via automated Savitzky–Golay filtering), variable selection (via lasso/adaptive lasso with cross-validated λ and BIC-based thresholding), and uncertainty quantification (via bootstrap confidence intervals). The method outperforms SINDy with AIC in a broad set of systems and conditions, particularly in three dimensions, likely because ARGOS mitigates multicollinearity and avoids brittle hard-thresholding on unstable OLS estimates. The analysis highlights the importance of data quality and quantity: identification accuracy generally increases with n and SNR. In deterministic (noiseless) scenarios, violations of regression assumptions (homoscedasticity) can lead to ancillary term selection; however, small amounts of noise restore the assumptions and lead to correct discovery. Compared with ESINDy, ARGOS’s automatic hyperparameter tuning and inference-oriented selection reduce the need for manual, system-specific tuning while maintaining strong performance across varying conditions.
ARGOS provides an automated, inference-based approach to discover interpretable ODE models from noisy, limited data by combining Savitzky–Golay denoising/differentiation, sparse regression (lasso/adaptive lasso), BIC-based threshold selection, OLS refitting, and bootstrap confidence intervals. The method consistently identifies correct models across a range of canonical systems and conditions, outperforming SINDy with AIC—especially in three-dimensional settings—and offering more stable scaling with data size. The approach underscores the value of rigorous statistical inference in data-driven equation discovery. Future directions include extending candidate libraries to ensure coverage of governing terms, improving handling of deterministic regimes where regression assumptions fail, reducing computational cost of bootstrap inference for high-dimensional and long time series, and exploring integration with complementary machine learning tools for broader applicability.
- Library coverage: Correct recovery is only possible if the true active terms are present in the design matrix (a fundamental limitation of regression-based identification).
- Data requirements: Performance improves with sufficient observations and moderate-to-high SNR; low n and low SNR degrade identification.
- Noiseless data issue: In deterministic settings, the homoscedasticity assumption is violated, leading to selection of ancillary terms; slight noise can mitigate this.
- Computational cost: Bootstrap sampling (e.g., 2000 resamples) becomes computationally demanding as n and dimensionality increase, potentially limiting real-time applicability.
- Multicollinearity: While mitigated via ridge pilots and adaptive lasso, multicollinearity can still challenge variable selection, particularly in high-dimensional libraries.
Related Publications
Explore these studies to deepen your understanding of the subject.

