logo
ResearchBunny Logo
CELL: a Python package for cluster expansion with a focus on complex alloys

Physics

CELL: a Python package for cluster expansion with a focus on complex alloys

S. Rigamonti, M. Troppenz, et al.

Discover CELL, a dynamic Python package for cluster expansion methods tailored to complex alloys, developed by Santiago Rigamonti, Maria Troppenz, Martin Kuban, Axel Hübner, and Claudia Draxl. With its modular framework, CELL adeptly handles various alloy systems, integrates advanced techniques for modeling, and leverages powerful visualization tools, promising innovative approaches to materials science.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses how to efficiently and accurately model configuration-dependent properties of complex materials systems where direct first-principles enumeration is infeasible due to combinatorial explosion. It motivates the cluster expansion (CE) method as a way to achieve near ab initio accuracy at reasonable cost by mapping properties onto generalized Ising-like models. The work highlights challenges posed by technologically relevant systems (multi-component, multi-sublattice, surfaces/interfaces, large parent lattices) and by data-driven model construction (training set design, model selection, finite-temperature analysis). The authors present CELL, a modular Python package integrating CE with modern machine-learning strategies and thermodynamic sampling, capable of handling arbitrary numbers of species and sublattices in 1D/2D/3D and interfacing broadly with ASE and ab initio codes. The purpose is to provide a comprehensive, extensible toolkit and to demonstrate its capabilities through surface alloy, bulk alloy, and complex intermetallic examples.
Literature Review
The work builds on the CE formalism for multicomponent, multisublattice systems and its basis-function construction (e.g., discrete Chebyshev and trigonometric bases). It references structure-selection strategies such as special quasirandom structures and compressed-sensing approaches (L0/L1 penalties; LASSO) for sparse, predictive models, and interfaces CELL with scikit-learn estimators. ASE is used for structure handling and calculator interfaces. Prior CE software (e.g., ATAT and more recent packages) is noted, situating CELL among modern CE toolkits with emphasis on handling complex materials, modularity, ML integration, and Wang–Landau sampling.
Methodology
Methods overview: - CE formalism: Configuration σ over a parent lattice is expanded on orthonormal cluster functions Γα(σ), yielding P(σ)=Σα Jα Γα(σ), or in symmetrized form P(σ)=Σα mα Xα(σ), where Xα are cluster correlation functions averaged over symmetry-equivalent clusters (orbits). For practical models, the infinite basis is truncated and effective cluster interactions (ECIs) J are learned from data. - ML framing: Given structures S, properties P, and a chosen cluster set C, construct matrix X of correlations and solve for J via regularized optimization J = argmin ||XJ−P||p + Φ(J). Strategies include ridge (ℓ2), LASSO (ℓ1), and combinatorial subset selection approximating ℓ0 sparsity. Overfitting/underfitting is monitored via cross-validation (CV). CELL implementation: core classes and workflow - Structure handling: ParentLattice (multi-component, multisublattice; inherits from ASE Atoms), SuperCell (expands ParentLattice), Structure (SuperCell with decoration), StructuresSet (container with database/serialization). Visualization via juview and ASE tools. - Property calculation: StructuresSet.set_calculator to attach any ASE calculator (e.g., EMT), or user-defined prop_func. Example property: adsorption energy E_ads for O on Pt/Cu(111), including BFGS relaxations with constraints on layers and degrees of freedom. - Cluster basis: ClustersPool defines candidate clusters by size and radius (e.g., npoints 1–4; radii controls including negative values for all unique clusters within supercell PBC). Serialization and info utilities provided. - Correlations: CorrelationsCalculator computes X using chosen basis (e.g., trigonometric), for selected ParentLattice and ClustersPool. - Model building: EstimatorFactory wraps scikit-learn and native estimators (e.g., Ridge). Model encapsulates CE, links to correlations and estimator. ModelBuilder automates cluster selection and estimator fitting via strategies: subsets_cv (monotone sets by size/radius) and lasso_cv (sparsity-controlled ℓ1 selection). Reports fit and CV metrics. Thermodynamics modules - Metropolis Monte Carlo: Canonical ensemble with specified sublattice concentrations; acceptance P=min[1, exp(−ΔE/kBT)], using CE-predicted total energy changes. Extract thermodynamic averages (e.g., Cp) from trajectories. - Wang–Landau (WL) sampling: Estimates configurational density of states g(E) by flat-histogram sampling with modulating factor f and flatness criteria; returns ConfigurationalDensityOfStates enabling computation of thermodynamic properties across temperatures without repeated MC. Parallel WL partitions energy ranges and stitches log g(E) using microcanonical temperature overlaps. Demonstration systems - Pt/Cu(111) with O adsorption: Parent lattice with Cu slab, top-layer Pt/Cu sublattice, and fcc hollow sites (vacancy/O). Generate 50 random structures (16 surface sites per example supercell), compute E_ads via EMT and custom workflow, build cluster pools (e.g., up to 4-point clusters), correlations, and CE models via ridge and ModelBuilder (subsets_cv and lasso_cv). Evaluate fit and CV errors and visualize predictions vs concentration. Perform MC and WL thermodynamics (various supercell sizes) to identify order–disorder transition and ground-state ordering. - Si–Ge: 16-atom supercell; generate 43 random configurations across Ge counts 0–16. DFT (FHI-aims, PBEsol, tight basis, 10×10×10 k-grid) for relaxed structures. Build CE models for energy of mixing per atom, E_mix, and lattice parameter a0 using cluster pools up to 3-point clusters and exhaustive subset selection under constraints. Validate via CV; enumerate all derivative structures up to 16 atoms for predictions; analyze random-alloy limit analytically for correlations; parallel WL for large cells (up to 2744 atoms) to study demixing thermodynamics. - Ba8Al16Si46 clathrate (54-atom parent cell): Iterative CE workflow combining initial random data, CE training, Metropolis sampling to identify lowest non-degenerate (LND) structures, ab initio evaluation, and model refinement. Iterate until CV target (~1 meV/atom) achieved; use improved models to predict ground-state structures across compositions and enable finite-temperature analyses.
Key Findings
- Pt/Cu(111)+O CE modeling: - Training set: 50 optimized random structures; initial clusters pool: 122 clusters. Ridge regression with α≈1e−8 exactly fits training (Fit RMSE=0.00000) but has CV RMSE=0.01412, indicating overfitting. - Subset-based selection (ModelBuilder, subsets_cv): Selected 12 clusters; errors: Fit RMSE=0.00469, CV RMSE=0.00677. - LASSO-CV selection: 22 clusters; Fit RMSE=0.00095, CV RMSE=0.00183. - Thermodynamics: Metropolis MC on 16-site and 64-site surface cells shows specific-heat peaks near ~450 K (smaller cell) and ~350 K (larger cell), indicating an order–disorder transition; WL sampling refines Cp(T), U, F, S with lower computational cost and identifies low-T ordered p(2×2) Pt arrangement and high-T disordered states. - Si–Ge alloy: - DFT training: 43 structures (0–16 Ge); FHI-aims PBEsol, relaxed to forces <5×10−3 eV/Å; 10×10×10 k-grid. - CE models: Cluster pool up to 3-point clusters (empty + one 1-point + four 2-point + six 3-point). Optimal E_mix model uses 8 clusters; optimal a0 model uses 11 clusters. Learning curves show CV minima preventing overfitting. - Predictions: Excellent agreement with DFT on training; full enumeration up to 16 atoms yields E_mix ≥ 0 for all structures → demixing tendency without ordered ground states. Lattice constant shows negative bowing (deviation from Vegard’s law); random-alloy CE prediction matches random training structures; experimental bowing is smaller in magnitude. - Demixing thermodynamics via parallel WL (32 CPUs; up to 2744 atoms): Flat histograms per energy window; stitched log g(E) via microcanonical temperature overlap. Canonical probability P(E,T) shows single maxima across T (non-first-order behavior for studied sizes). Cp(T) peaks near T≈196 K (largest system), suggesting transition ~200 K, consistent with literature range (∼170–360 K). Microcanonical samples visualize progression from phase-separated to mixed/disordered states with increasing T. Estimated Δa(T) decreases with T, approaching random-alloy value above transition; near 196 K, Δa close to experiment. - Ba8Al16Si46 clathrate (complex parent lattice): - Iterative CE converges in 4 iterations. Iteration-wise CV RMSE: ~4.9 → 4.4 → 0.9 → 0.8 meV/atom; outliers diminish, box plots show improved error distributions. Ground-state structures across x=6–16 identified with only about 40 ab initio calculations; quasi-degenerate alternatives at some compositions confirmed. Final models accurate enough for finite-T studies (e.g., prior finding of a temperature-driven semiconductor-to-metal transition accompanied by partial order–disorder).
Discussion
The findings demonstrate that CELL provides a robust, modular framework for building CE models that balance accuracy and generalization through principled model selection (subset enumeration and LASSO) and rigorous cross-validation. For complex, coupled sublattice surface systems, CELL can capture adsorption energetics and finite-temperature ordering phenomena, identifying order–disorder transitions and ground-state ordering patterns. In bulk alloys like Si–Ge, CE models derived from relatively small, high-quality DFT datasets accurately reproduce energies and structural parameters, enable full enumeration within supercell limits, and, combined with WL sampling, provide comprehensive thermodynamic insights (density of states, canonical probabilities, specific heat, thermal expansion) including access to phase-separated states in the canonical ensemble. For materials with large parent cells (e.g., clathrates), CELL’s iterative workflow efficiently discovers ground-state structures while steadily improving model accuracy, making problems intractable by brute-force enumeration feasible. Altogether, results substantiate CELL’s suitability for tackling complexity (multi-component/sublattice, surfaces, large parent lattices), leveraging ML, and performing scalable thermodynamics analyses.
Conclusion
The paper introduces CELL, a comprehensive Python package for cluster expansion and statistical thermodynamics. It unifies structure generation, property calculation, cluster-basis construction, correlation computation, model selection with scikit-learn integration, and finite-temperature sampling (Metropolis and Wang–Landau). Through three representative applications—O–Pt/Cu(111) surface alloy, Si–Ge bulk alloy, and the complex clathrate Ba8Al16Si46—CELL is shown to build accurate, generalizable CE models, reveal ordering and demixing phenomena, and efficiently identify ground states even for large parent lattices. Future work can expand AI-driven model construction, enhance parallel sampling and uncertainty quantification, and broaden interfaces to electronic-structure and interatomic potential frameworks, enabling deeper insights into emerging complex materials.
Limitations
- Surface example used EMT potentials: while EMT for Pt and Cu is reasonable, oxygen is not well described; results for adsorption energies are illustrative and not fully quantitative. - Structural relaxations for the surface system were constrained to avoid complex reconstructions; such effects could be incorporated by extending the parent lattice but were omitted for simplicity. - Overfitting can occur when the cluster pool exceeds training data; careful model selection (subset or LASSO) is required. - Si–Ge thermal expansion analysis neglects anharmonic effects, likely overestimating the magnitude of Δa at higher temperatures. - WL/MC sampling and CE accuracy depend on chosen cluster truncations, training-set quality/size, and convergence criteria; finite-size effects influence transition-temperature estimates.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny