logo
ResearchBunny Logo
Electronic structure prediction of multi-million atom systems through uncertainty quantification enabled transfer learning

Engineering and Technology

Electronic structure prediction of multi-million atom systems through uncertainty quantification enabled transfer learning

S. Pathrudkar, P. Thiagarajan, et al.

This research, conducted by Shashank Pathrudkar, Ponkrshnan Thiagarajan, Shivang Agarwal, Amartya S. Banerjee, and Susanta Ghosh, tackles the challenges of Kohn-Sham Density Functional Theory simulations by employing transfer learning and Bayesian neural networks. This innovative approach allows for confident predictions in material properties at multi-million-atom scales with limited computational resources.... show more
Introduction

Kohn-Sham DFT is the standard tool for predicting electronic structure but scales cubically with system size, limiting simulations of large and complex materials. Accurately predicting the ground-state electron density is valuable because it determines many ground-state properties and underpins excited-state analyses. Existing linear or subquadratic scaling methods often have limited applicability, face convergence issues, or remain restricted to a few thousand atoms. Reducing the prefactor of cubic-scaling diagonalization helps but still struggles with routine large-scale studies on modest resources. Machine learning offers a surrogate pathway: prior work has targeted energies/forces (interatomic potentials) or the electron density itself. Predicting density directly is attractive because it contains richer information than energies/forces alone and can enable electronic-structure-aware models. Two main ML strategies exist for density: compact atom-centered basis expansions (requiring species-specific basis optimization) and grid-point predictions (computationally heavier at inference). The latter has shown strong results for bulk materials and metals and is adopted here. The research question is how to build ML models that (i) predict electron densities accurately for bulk metallic and semiconducting systems across scales and configurations, (ii) provide trustworthy uncertainty quantification, and (iii) drastically reduce the data-generation cost incurred by KS-DFT. The proposed solution combines thermalization-driven sampling, simple invariant descriptors, Bayesian neural networks for uncertainty quantification, and transfer learning to leverage abundant small-system data with limited large-system data.

Literature Review

Prior work has explored ML surrogates for electronic structure: ML interatomic potentials learned from DFT energies/forces enable ab initio-accuracy molecular dynamics. Direct ML prediction of electron density has used two output representations: atom-centered basis expansions (transferable but requiring basis optimization) and grid point-wise predictions (more general but heavier inference). Equivariant neural networks have been used to encode symmetry, while alternative approaches use symmetry-invariant descriptors for scalar properties like density. Several linear or subquadratic scaling DFT methods exist and have demonstrated large problem sizes but with constraints on system types or high computational cost. Recent ML works demonstrated density prediction for large periodic systems and speedups, but typically relied on large-system data and lacked systematic uncertainty quantification. Efforts to learn exchange-correlation functionals/potentials are related but orthogonal to the present goal of density prediction.

Methodology

Data generation: Ab initio molecular dynamics (AIMD) with SPARC finite-difference DFT using GGA-PBE and ONCV pseudopotentials. Mesh spacing: 0.25 Bohr (Al), 0.4 Bohr (SiGe). SCF tolerance 1e-5 with Periodic-Pulay acceleration. NVT Nosé–Hoover thermostat; Fermi-Dirac smearing at electronic temperature 631.554 K. Time step 1 fs. Snapshots sampled with temporal spacing to reduce correlation. Temperatures spanned from 315 K to near twice melting (Al up to ~1866 K; SiGe up to ~2600 K). Training systems used multi-scale data: Al (32-atom then 108-atom), SiGe (64-atom then 216-atom). Additional DFT datasets were generated for defects (mono-/di-vacancies; edge/screw dislocations; grain boundaries) and volumetric strains (Al up to ±5%). Random bulk disordered SiGe alloys were created consistent with composition. Descriptors: For each grid point, local atomic neighborhood is encoded using scalar-product-based invariant features. Set I: distances from the grid point to M nearest atoms. Set II: cosines of angles at the grid point formed by selected atom pairs among a subset M_s of nearest atoms and their k nearest neighbors, yielding N_setII = M_s k features. Features are sorted, ensuring invariance to rotation, translation, and permutation. Higher-order scalar products were not used as they did not improve accuracy. Descriptor selection: A convergence analysis determines optimal counts N_setI (by increasing M) and N_setII (by varying M_s and k) to balance accuracy and computational cost, guided by nearsightedness and screening, which limit the range of influential atoms. The total descriptor count is minimized subject to test RMSE convergence. Model: A Bayesian Neural Network (BNN) maps descriptors to electron density at each grid point. Variational inference with a tractable variational posterior q(w|θ) is used; training minimizes the standard KL-regularized evidence lower bound objective balancing prior complexity and data fit. A heteroscedastic Gaussian likelihood is adopted so the BNN outputs both the mean density and a learned, spatially varying observation noise σ(x) that captures aleatoric uncertainty. Uncertainty quantification: Total predictive uncertainty is decomposed into aleatoric (learned σ^2(x)) and epistemic (parameter uncertainty estimated via sampling from q(w|θ) and computing variance across predictions). Reported intervals use ±3σ_total. Transfer learning: The network is first trained on abundant small-system data. Early layers are frozen and later layers fine-tuned with limited large-system data to correct boundary-condition and configurational limitations inherent to small cells. Probability distributions of descriptors guide the TL design to select the maximal necessary training system size. Inference and scaling: For a query structure, descriptors are computed for all grid points, passed through the BNN to obtain point-wise densities and uncertainties, and aggregated to form the full field. Because only a fixed number of nearby atoms influence each grid point and grid count scales with system size at fixed mesh spacing, inference scales O(N) with atom count. Postprocessing: Predicted densities are rescaled to enforce total electron count and used to set up a KS Hamiltonian in SPARC. A single diagonalization and the Harris–Foulkes functional yield total energies, enabling energy and derivative comparisons (lattice parameter, bulk modulus) against DFT.

Key Findings
  • Accuracy on large systems beyond training size: For Al (trained on 32 and 108 atoms), accurate predictions were achieved for a 1372-atom cell at 631 K. For SiGe (trained on 64 and 216 atoms), accurate predictions were achieved for a 512-atom Si0.6Ge0.4 cell at 2300 K. Reported RMSE values on these large tests confirm high fidelity beyond training scale.
  • Defects and strains: The model generalizes to mono-/di-vacancies, grain boundaries, and edge/screw dislocations in Al, and to vacancies in SiGe, despite no defect data in training. Electron-density error magnitudes (L1 norm per electron) remained low, and normalized RMSEs were small. Including a single defect-containing snapshot in training markedly reduced error and uncertainty localized at the defect.
  • Composition generalization: Models trained only on equiatomic Si0.5Ge0.5 generalized well to nearby compositions (e.g., x = 0.4–0.6), with degradation as compositions moved further from training, consistent with expectations.
  • Energies and derived properties: Postprocessed total energies from ML-predicted densities are well within chemical accuracy; errors are generally below about 1e-4 Ha per atom and within the 1.6 mHa per atom chemical-accuracy threshold across tested systems, including defects and composition variations. Lattice parameter predictions are accurate to ~0.01 Bohr or better. The bulk modulus for an Al 3×3×3 supercell is 76.39 GPa (DFT) vs 75.80 GPa (ML), within ~1%.
  • Transfer learning efficacy: TL reduced the test RMSE on a 256-atom Al system by about 50% compared to a model trained only on small-system data. It also reduced training data generation time by roughly 55% relative to a non-TL approach requiring more large-system data. Similar savings were observed for SiGe.
  • Computational scaling and speedups: Inference scales linearly with system size versus cubic scaling for DFT. For ~500 atoms, wall times are over two orders of magnitude lower than KS-DFT on the same CPU resources. Descriptor calculation and ML inference maintain O(N) behavior.
  • Million-atom predictions with quantified uncertainty: The model predicts electron densities (with uncertainty) for ~4.1 million-atom Al and ~1.4 million-atom SiGe systems. The magnitude of total uncertainty for these systems is comparable to smaller, validated systems, enabling confident use where DFT is infeasible.
Discussion

The results demonstrate that combining thermalization-based sampling, invariant descriptors, and Bayesian neural networks yields accurate, transferable electron-density surrogates for metals and semiconductors across scales and configurations. Decomposition of uncertainty into aleatoric and epistemic parts clarifies the sources of error: higher aleatoric uncertainty appears near nuclei and defects due to data variability, while epistemic uncertainty highlights data paucity (e.g., at rare configurations like vacancies). Adding even a single defect-containing snapshot reduces localized errors and epistemic uncertainty, illustrating how UQ guides targeted data acquisition. Transfer learning effectively bridges small- and large-cell physics by retaining generalizable features learned from abundant small-scale data while correcting large-scale effects with minimal expensive data, reducing overall DFT data-generation cost by more than half. Accurate postprocessed energies, lattice parameters, and bulk moduli verify that learned densities preserve key physics. Linear scaling and large speedups over DFT enable predictions for multi-million-atom systems, with UQ providing confidence bounds where direct validation is not possible. Together, these findings validate a practical pathway for routine electronic-structure-aware modeling of large, complex bulk materials on modest computational resources.

Conclusion

This work presents an uncertainty-quantified, transfer-learning-enabled ML framework that predicts ground-state electron densities from simple, symmetry-invariant descriptors with linear scaling. It achieves high accuracy across sizes, defects, and alloy compositions, reproduces energies and derived properties within chemical accuracy, and delivers more than 100× speedups at a few hundred atoms. Transfer learning cuts training data generation costs by over 50% while improving accuracy, and Bayesian UQ offers reliable confidence intervals and diagnostic insight into data biases and coverage. The approach enables confident predictions for systems with millions of atoms, far beyond routine KS-DFT. Future directions include active learning driven by UQ to further minimize data requirements, extension to more complex and compositionally rich materials, and exploration of applicability to molecules across chemical space.

Limitations
  • The approach relies on KS-DFT data quality, including functional and pseudopotential choices; aleatoric uncertainty captures but does not remove such noise/approximations.
  • Transfer learning still requires some large-system DFT data; its effectiveness is bounded by the largest practically simulated system.
  • Generalization degrades as test conditions move far from the training distribution (e.g., alloy compositions far from 50-50); targeted data addition may be needed.
  • Data near rare configurations (e.g., close to nuclei, defect cores) is sparse, leading to higher epistemic uncertainty unless specifically augmented.
  • Grid-point inference can be computationally heavy for extremely fine meshes, though it scales linearly and is embarrassingly parallel.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny