logo
ResearchBunny Logo
Combining data and theory for derivable scientific discovery with AI-Descartes

Mathematics

Combining data and theory for derivable scientific discovery with AI-Descartes

C. Cornelio, S. Dash, et al.

Dive into groundbreaking research that combines logical reasoning with symbolic regression to uncover the underlying laws of nature! This innovative method effectively reveals governing equations from limited experimental data, as demonstrated through iconic principles such as Kepler's third law and Einstein's time dilation. Conducted by a team of experts including Cristina Cornelio and Sanjeeb Dash, this work promises an exciting leap in scientific modeling.

00:00
00:00
~3 min • Beginner • English
Introduction
The work addresses how to discover symbolic models that are not only empirically accurate but derivable from background scientific theory expressed as logical axioms. Standard neural networks provide black-box predictors, and fixed-form regression prespecifies functional forms. Symbolic regression (SR) allows discovering interpretable functional forms from operators but often yields many plausible expressions that fit data yet may contradict known theory. Existing approaches typically constrain functional form or use invariances; however, incorporating general axiomatic background knowledge and proving derivability remains open. The authors propose integrating SR with a formal reasoning system to generate candidate laws from data, then prove (or refute) their derivability from axioms, aiming for models that are correct, predictive, and insightful with low complexity and bounded error.
Literature Review
Prior SR work identifies meaningful functions balancing accuracy and complexity (e.g., Schmidt and Lipson), but many expressions may satisfy such criteria without aligning with background theory. Previous integrations of ML with reasoning constrained outputs or functional forms (e.g., Marra et al. on GANs, Scott et al. and Ashok et al. combining ML tools with reasoning, Kubalík et al. using prior knowledge to augment data), yet did not incorporate general background-theory axioms and unmeasured variables. Automated theorem provers (ATPs) generally prove conjectures given a theory, with computational challenges (undecidability for some logics and difficulty with arithmetic/calculus). Machine learning has been used to guide ATP search (e.g., reinforcement learning), but practical tools to generate derivable models consistent with data from axioms have been lacking.
Methodology
The system formalizes discovery as finding a symbolic model y = f(x) that fits m data points, is derivable from background theory B, has low complexity, and bounded prediction error. Inputs are a 4-tuple (B, C, D, M): B is a set of domain-specific axioms (first-order logic with equality/inequality and arithmetic), assumed complete and consistent, implying a unique derivable function f_B for the target; C is a hypothesis class (grammar, invariance constraints, functional constraints such as monotonicity and dimensional consistency); D is the dataset; M encodes modeler preferences (e.g., error bounds and complexity limits). A generalized notion of distance quantifies compatibility of a learned model with data and background theory. For any candidate f, δ(f) measures distance to data D, while β(f) measures distance to the derivable function implied by B (β(f)=0 if f is derivable). The SR module proposes multiple candidate expressions using a new mixed-integer nonlinear programming (MINLP)-based SR solver that selects from operators {+,-,*,/, log, exp}, enforces an upper bound on expression complexity and number of free constants, imposes linear/nonlinear constraints (including dimensional consistency), and approximately solves multiple MINLP instances (subject to time limits) to minimize least-squares error. The reasoning module (KeYmaera X) checks derivability, computes reasoning-based distances, and can produce counterexamples for non-derivable models. Mathematica is used to analyze symbolic properties and verify functional constraints. Reasoning-based quality measures include: (1) pointwise reasoning error and (2) generalization reasoning error, both defined via norms (e.g., l1, l∞) comparing f(X^i) to values of a derivable function f_g implied by B at dataset points; and (3) variable dependence analysis by extending variable domains by an order of magnitude to test sensitivity of β over S′, identifying missing dependencies. Workflows: The discovery cycle generates hypotheses from data via SR, filters by complexity/accuracy and constraints, and then assesses derivability with the reasoning system against B, returning proofs or inconsistencies and reasoning-based distances. If no candidate is derivable, the system suggests collecting more data or adding constraints/axioms. Case-specific formulations: - Kepler’s third law: Target p = sqrt(4π^2 d^3 / (G(m1+m2))). Axioms K1–K7 encode center of mass (m1 d1 = m2 d2), distance d = d1 + d2, gravitational force F_g = G m1 m2 / d^2, centrifugal force F_c = m2 d2 ω^2, force balance F_g = F_c, period p = 2π/ω, and positivity constraints. SR uses operators {+,-,×,÷}. Reasoning computes β measures and variable dependence. - Relativistic time dilation: Target f/f0 = sqrt(1 - v^2/c^2). Data from high-precision atomic clocks (Chou et al.). Axioms from Behroozi and Smith are used. The system also tests an alternative “Newtonian” axiom replacing constant c with sqrt(v^2 + c^2) for perpendicular emission to assess theory discrimination. - Langmuir adsorption: Target q = q_max K_a p / (1 + K_a p) and a two-site extension. Axioms L1–L5 encode site balance, adsorption/desorption rates, equilibrium, and mass balance on q, with positivity/non-negativity constraints. Because SR yields numeric constants, the proof uses existential quantification: replace numeric constants with logic variables c_i and prove ∃c_i: (C ∧ A) → (f ∧ C′), allowing constants to be functions of K_ads, K_des, S0, etc., but not of variables p or q. Additional desired properties κ (e.g., monotonicity, limiting behavior) are verified in Mathematica to prune candidates.
Key Findings
- Kepler’s law: SR produced accurate candidate formulas, but reasoning identified derivability and generalization. For the solar-system dataset, p = √(0.1319 d^3) achieved low numerical error ε≈0.0129 and low pointwise reasoning error β (≈0.0064), yet dependency analysis showed missing mass dependence (m1, m2). For exoplanets, p = 0.1319 d^3 / m1 had improved generalization reasoning error (≈0.0231) but still incomplete dependencies. For binary stars, the best formula p = d^3 / (0.9967 m1 + m2) had small numerical error (ε≈0.0058), very low generalization reasoning error (β_l∞≈0.0008), and correct dependencies on m1, m2, and d, indicating strong generalization and near-derivability consistent with Kepler’s third law under Newtonian mechanics. - Time dilation: Although the exact relativistic form was not rediscovered from data, the reasoning module identified models that generalize extremely well. For instance, −0.00537 v^2/(1+v) achieved small relative generalization reasoning error (≈0.0021) over a wide domain (37 ≤ v ≤ 10^9 m/s), and −0.00545 v^2/(1+√v) had ≈0.0010 over 37 ≤ v ≤ 10^7 m/s. Replacing the constant-c axiom with a “Newtonian” alternative yielded large reasoning errors (absolute generalization error > 1 and elevated pointwise errors) for all candidates, enabling the data-driven rejection of the alternative axiom system. - Langmuir adsorption: On Langmuir’s methane-on-mica data (90 K), a two-constant candidate f2: q = p/(0.00927 p + 0.0759) was logically derived (KeYmaera: Yes) and satisfies 5/5 desired properties, while a better-fitting two-constant f1 was not derivable. On Sun et al.’s isobutane-on-silicalite data (277 K), g2: q = p/(0.709 p + 0.157) was also derivable (single-site Langmuir equivalent; 5/5 properties). Four-constant candidates g5 and g7 were equivalent to the two-site Langmuir model, achieving low numerical errors and satisfying multiple constraints. These results show the system can both identify derivable models and distinguish non-derivable but accurate fits.
Discussion
Integrating symbolic regression with formal logical reasoning yields models that are not only accurate but consistent with background scientific theory and that generalize beyond observed data. The reasoning module’s β-based measures help discriminate among similarly accurate SR candidates, identify missing variable dependencies, and provide counterexamples for non-derivable expressions. Across three domains—celestial mechanics, relativity, and surface adsorption—the approach either recovered known laws, found highly generalizable approximations, or formally proved derivability of SR-extracted formulas. The framework also enables using data to discriminate between competing axiom systems (e.g., rejecting a Newtonian alternative in place of relativistic axioms for time dilation). Enhanced integration with abductive reasoning and experimental design could further accelerate scientific discovery by suggesting missing axioms and informative experiments.
Conclusion
The paper presents AI-Descartes, an end-to-end system that combines a MINLP-based symbolic regression engine with an automated reasoning module to discover symbolic models that are provably derivable from background axioms while fitting experimental data with low complexity and error. It introduces reasoning-based distances to theory and data, and variable-dependence diagnostics to assess model quality. Demonstrations include rediscovering or closely approximating Kepler’s third law, identifying generalizable time-dilation models and rejecting incompatible axiom systems, and formally proving SR-derived Langmuir adsorption expressions via existential quantification of constants. Future work includes deeper integration of reasoning and regression (e.g., abduction), improved theorem-prover and SR solver scalability and expressivity, and incorporation of optimal experimental design to target data collection that most effectively discriminates hypotheses.
Limitations
- Assumes correctness and completeness of background theory; missing axioms can hinder derivability proofs. Potential remedy: abductive reasoning to propose explanation axioms consistent with the theory and sufficient to derive observations. - Limited availability of machine-readable axioms for natural sciences; automated knowledge extraction from text (NL/LaTeX/HTML) remains immature for this purpose. - Automated theorem proving faces computational complexity and undecidability in relevant logics, with high runtime variance; difficulty increases with arithmetic/calculus operators. Current tools (e.g., KeYmaera X) have expressivity limits (supports ODEs only in time; no higher-order logic). - Symbolic regression (MINLP/Gp-based) can be computationally expensive with worst-case exponential scaling; solvers like BARON cannot handle differential equations. - System components like experimental design and abduction are not fully integrated; deeper coupling could improve data collection and axiom completion.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny