logo
ResearchBunny Logo
Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning

Chemistry

Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning

A. S. Anker, E. T. S. Kjær, et al.

Dive into the world of material science with our cutting-edge research! This paper unveils the Machine Learning based Motif Extractor (ML-MotEx), a revolutionary tool that uncovers important features for model quality in X-ray and neutron scattering studies. Conducted by a talented team from the University of Copenhagen and collaborating institutions, this work sheds light on disordered nanomaterials and clusters using advanced machine learning techniques.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of determining local structural motifs in nanostructured, disordered, or amorphous materials from pair distribution function (PDF) data. Traditional crystallographic techniques rely on long-range order and are not generally applicable to nanomaterials. PDF analysis can probe local structure but structure solution from PDF is rarely possible, necessitating refinement against plausible starting models. Selecting suitable starting models is a major bottleneck, especially given that brute-force screening of many candidate motifs is computationally prohibitive and does not readily yield interpretable insights about which structural features matter. The authors propose ML-MotEx, an explainable machine learning workflow that learns from fits of a subset of candidate motifs to predict fit quality and quantify the importance of individual atoms or features for achieving good fits. This aims to accelerate screening, reduce computational costs, and provide interpretable, feature-level insights into structural motifs present in the data.
Literature Review
Prior approaches include Reverse Monte Carlo and the LIGA algorithm for limited cases of structure solution from PDFs, but broadly applicable ab initio solutions remain elusive. Automated strategies such as structure mining and cluster mining have been developed to generate and fit large numbers of candidate models to identify best fits; however, these are computationally intensive and provide limited interpretability regarding feature importance. The authors’ previous work generated catalogs of MoOx and metal nanocluster structures for brute-force fitting to identify likely motifs, but scalability and interpretability were problematic. Interpretable ML has shown promise in related materials characterizations. ML-MotEx builds on these insights by leveraging gradient-boosted decision trees and SHAP values to provide motif-level importance directly from PDF fits, addressing both scalability and interpretability gaps highlighted in the literature.
Methodology
ML-MotEx comprises four steps: (1) Catalogue generation: From a chosen starting structure (often a unit-cell-derived discrete model without symmetry), generate candidate motifs by randomly removing subsets of selected atom types (typically heavier scatterers). The permutation number N is the number of atoms considered for inclusion/exclusion, yielding a potential space of 2^N structures. Only a small random subset (~10^4 total motifs per starting model; ~140–3000 per N in examples) is sampled. Non-permuted atoms (e.g., oxygen) not within a user-defined bonding distance to retained atoms are removed. (2) PDF fitting: Each candidate motif’s PDF is computed using the Debye equation and fitted to the target PDF with DiffPy-CMI. Typical refinements include scale factor, isotropic expansion/contraction, and isotropic ADPs; atomic positions can be refined but were generally fixed for computational efficiency. The fit quality is measured by Rwp. (3) ML prediction: An XGBoost gradient boosting decision tree regressor is trained to predict Rwp from binary features indicating presence/absence of each permuted atom and the motif size (number of atoms). Data are split into 80% training and 20% test. Hyperparameters (e.g., learning rate, max depth) are tuned via Bayesian optimization with cross-validation (50 iterations). The model aggregates predictions across 100 trees. (4) Explainability and feature importance: SHAP values are computed for each feature (atom and cluster size) per fitted motif. For each atom, SHAP values are separated into ‘kept’ and ‘removed’ cases to compute average SHAPs. The atom contribution value is defined as SHAP_average-kept − SHAP_average-removed (negative indicates atom lowers Rwp when kept). Its uncertainty is the RMS difference between kept and removed SHAP distributions, and a confidence factor is defined as contribution/uncertainty. Outputs include atom-colored VESTA/CrystalMaker files according to contribution values. Efficiency and scalability: By learning from a sub-sample of candidate motifs, ML-MotEx avoids exhaustive brute-force fitting (which scales as 2^N). Reported runtimes are minutes to hours versus days to astronomical times for brute-force on larger N. Hardware examples include ~100 s for 10^4 fits for a 48-atom model on a 64-core Threadripper CPU.
Key Findings
- ML-MotEx accurately predicts fit quality and identifies key structural motifs from PDF data while using only a small subset of candidate motifs for training, enabling rapid screening compared to brute-force. - Example 1 (C60 buckyball, simulated PDF): Trained on 384,260 candidate motifs (out of an intractable 2^132 possibilities), the GBDT model predicted Rwp with MAE ≈ 2.0% and MSE ≈ 11.7% on the test set (76,852 motifs). SHAP analysis showed cluster size as the dominant feature, with small clusters (0–34 atoms) yielding large positive SHAP values (poor fits). Atom-level contributions highlighted the central 60 atoms forming the C60 motif as strongly favorable (negative contribution), effectively recovering the buckyball despite it not being in the catalog. Average confidence factor for mislabelled atoms was low (0.37) versus overall (1.26). - Example 2 (Disordered MoOx on γ-Al2O3, experimental PDF): Using a Mo36O128 POM-derived starting model, best-fitting candidate motifs typically had 5–7 Mo atoms (lowest Rwp ≈ 45% for a Mo5O24 motif). SHAP-derived atom contributions highlighted edge-sharing [MoO6] octahedral ‘triads’ and their connectivity as key motifs, consistent with prior brute-force findings and with heptamolybdate-like [Mo7O24]6− connectivity. ML-MotEx identified local motifs; medium-range order remained beyond scope. - Example 3 (α-Keggin clusters in solution, experimental PDF): Four different starting models containing the α-Keggin motif (N = 24, 48, 48, 72 permuted atoms) each yielded, via atom contribution mapping, the α-Keggin W12O40 motif as the kept atoms, with only a few mislabelled atoms in the larger-N cases. For 10^4 sampled motifs per starting model, ML-MotEx completed in ~100 s for N=48 on a 64-core CPU. Brute-force would take ~3×10^6 years for N=48 and ~6×10^13 years for N=72. - Additional demonstration (Supplementary): A larger [Bi38O45] ionic cluster was identified using a ‘cookie-cutter’ catalog generation strategy from β-Bi2O3, showing extensibility to other systems. - Scalability: Analyses that would require ~24 days (N=24), ~3×10^6 years (N=48), or ~6×10^13 years (N=72) by brute-force are reduced to minutes–hours with ML-MotEx. - Interpretability: SHAP-based atom contribution and confidence factors quantify which atoms/features decrease or increase Rwp, enabling unbiased motif extraction and facilitating chemical interpretation.
Discussion
ML-MotEx addresses the core challenge of model selection in PDF-based structural analysis by combining efficient learning of Rwp with per-feature importance via SHAP. This allows not only rapid identification of promising candidates but also principled extraction of the structural motifs that most strongly improve fit quality. Across simulated and experimental datasets (C60, disordered MoOx, and α-Keggin clusters), ML-MotEx reliably recovered the correct local motifs and quantified their importance, even when trained on a small subset of the combinatorial search space. The method operates in quasi-experimental time and scales to larger systems than feasible by brute-force. Relative to LIGA, which builds clusters from interatomic distance lists, ML-MotEx works directly on measured PDFs, avoiding the nontrivial and non-unique distance-list extraction. However, unlike LIGA’s ab initio nature, ML-MotEx requires a starting model that contains the target motif. The approach can be integrated with structure-mining workflows (e.g., structureMining@PDFitc) to identify plausible starting structures, followed by ML-MotEx for motif extraction and interpretability. The method’s speed and interpretability make it promising for time-resolved studies where motif populations evolve over time, observable via changing SHAP values. While demonstrated with PDF fitting in step 2, steps 1, 3, and 4 are modality-agnostic and could be adapted to other characterization techniques.
Conclusion
The study introduces ML-MotEx, an explainable ML framework that efficiently screens candidate structural motifs against PDF data and quantifies atom- and feature-level contributions to fit quality. It dramatically reduces computational costs versus brute-force searches and provides interpretable outputs that highlight key motifs, as demonstrated on simulated C60, disordered MoOx, and α-Keggin clusters, with extensibility shown for larger clusters. The approach bridges traditional refinement and ab initio motif discovery by requiring only a plausible starting model while delivering feature-level insights. Future directions include integrating physics-informed goodness-of-fit metrics (e.g., DFT-informed penalties) to discourage unphysical motifs, deploying ML-MotEx on PDFitc.org, coupling with structure-mining to automate starting model selection, extending step 2 beyond PDF to other modalities, and applying to time-resolved datasets to track motif dynamics.
Limitations
- Requires a starting structure containing the correct motif; performance and correctness depend on the quality/relevance of the starting model. - Motif selection is based on fit-quality (Rwp) alone and may yield chemically unphysical motifs unless constrained or augmented (e.g., with DFT or additional priors). - Sensitivity to catalog size and permutation number N: larger N may lead to occasional mislabelled atoms unless more samples/fits are used. - Current implementation uses PDF fitting for step 2; adaptation to other techniques requires method-specific fitting but steps 1, 3, and 4 are transferable. - Provides local-motif insight; medium-range order may not be captured by the extracted motifs alone.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny