Chemistry

Extracting structural motifs from pair distribution function data of nanostructures using explainable machine learning

A. S. Anker, E. T. S. Kjær, et al.

Dive into the world of material science with our cutting-edge research! This paper unveils the Machine Learning based Motif Extractor (ML-MotEx), a revolutionary tool that uncovers important features for model quality in X-ray and neutron scattering studies. Conducted by a talented team from the University of Copenhagen and collaborating institutions, this work sheds light on disordered nanomaterials and clusters using advanced machine learning techniques.

00:00

Playback language: English

Index

Introduction

Understanding the relationship between material structure and properties is crucial for developing advanced functional materials. Crystallographic methods using scattering and diffraction have been essential, enabling ab initio determination of crystal structures. However, these methods are challenged by nanomaterials with limited long-range order, making ab initio structure determination impossible. Total scattering with PDF analysis has emerged as a valuable tool for characterizing nanomaterial structure, applicable to nanomaterials, disordered, and amorphous materials. Structure solution from PDF is generally limited to simple cases using Reverse Monte Carlo or LIGA. The need to propose reasonable starting models and refine parameters against data remains a bottleneck, especially for nanoparticles where significant structural changes occur. Automated methods like 'structure mining' and 'cluster mining' have been developed to address this, but these are computationally expensive and analyzing the resulting fits is challenging. This study introduces ML-MotEx, an explainable machine learning approach that addresses these computational and analytical limitations.

Literature Review

Recent advancements in automated structure determination methods, such as 'structure mining' and 'cluster mining', have been explored to overcome the challenge of finding suitable starting models for PDF analysis. These methods automate the generation and fitting of thousands of potential structure models, significantly expanding the search space. However, these brute-force approaches are computationally intensive and do not scale well with increasing model complexity. Furthermore, the analysis of the large number of models generated and their respective fits requires extensive manual inspection, introducing potential biases and making it challenging to identify the key structural features driving the fits. This paper aims to overcome these limitations by leveraging explainable machine learning.

Methodology

ML-MotEx employs a four-step process: 1) **Catalogue Generation:** A starting structure model is used to generate a large catalogue of candidate structural motifs by systematically removing atoms. 2) **Fitting:** Each candidate structure's PDF is computed and fitted to the experimental data using a fitting script (here, DiffPy-CMI). 3) **Machine Learning:** The fits (structures and Rwp values) are used to train a Gradient Boosting Decision Tree (GBDT) model using XGBoost. The GBDT learns to predict Rwp values based on the atomic structure model. 4) **SHAP Value Analysis:** SHAP values are calculated for each atom or structural feature, quantifying its importance for the fit quality. The magnitude of the SHAP value indicates importance, while the sign indicates whether the feature improves (negative) or worsens (positive) the fit. Atom contribution values and confidence factors are derived from SHAP values to aid interpretation. The algorithm's scalability is highlighted by comparing its computational time to brute-force methods. Structure permutation, a key aspect of the catalogue generation, involves randomly removing atoms from the starting model to create variations and considers atomic scattering power when selecting atoms for permutation. The distance threshold for removing non-permuted atoms is user-defined and based on expected bond lengths.

Key Findings

ML-MotEx was successfully applied to four examples: 1) **C60 Buckyball (Simulated Data):** The algorithm successfully identified the C60 structure, demonstrating its effectiveness in identifying the correct motif. SHAP analysis showed the number of atoms as the most important feature, and individual atom contribution values accurately highlighted the atoms comprising the C60 structure. 2) **Disordered Molybdenum Oxides (Experimental Data):** ML-MotEx successfully reproduced the finding of a 'triad' motif, composed of three edge-sharing MoO6 octahedra, from previous brute-force analysis, but more efficiently and with quantitative analysis using SHAP values. 3) **Keggin Clusters in Solution (Experimental Data):** Using four different starting models containing the α-Keggin cluster motif, ML-MotEx consistently identified the key structural motif. Minor variations were observed depending on the starting model complexity, highlighting some sensitivity to the initial model. The scalability of ML-MotEx is evident by the comparison to the computational time of brute-force methods. 4) **Bi38O45 Cluster (Experimental Data & Cookie-cutter Approach):** A 'cookie-cutter' strategy was used for generating the structure catalogue, demonstrating the adaptability of ML-MotEx. The results demonstrated the reproducibility and robustness of the ML-MotEx method, even with different starting models containing the target structural motif.

Discussion

ML-MotEx offers a significant advance over traditional PDF analysis by providing a quantitative measure of each atom's contribution to the overall fit quality, unlike traditional methods that only assess the overall fit of the entire model. The use of explainable ML allows for a more comprehensive understanding of the structural features contributing to the goodness-of-fit. The method's speed and scalability greatly exceed brute-force approaches, enabling the analysis of larger and more complex systems, including time-resolved experiments. Although ML-MotEx requires a starting model containing the target motif, the results show its robustness and ability to extract the correct motif even with varied starting models. The confidence factor helps assess the reliability of the identified motifs. The method's potential for integration with other structural screening tools like structureMining is suggested. The limitation of requiring a starting model that contains the target motif is mitigated by the use of chemical/structural intuition, commonly utilized within the chemistry community. The overall method allows for faster, more unbiased, and insightful analysis of PDF data compared to traditional approaches.

Conclusion

This work introduces ML-MotEx, a novel method leveraging explainable machine learning to extract structural motifs from PDF data. The method successfully identifies key structural features in various nanostructured materials, exceeding the computational efficiency and providing better interpretability than traditional brute-force approaches. Future directions include extending the method to other scattering techniques and integrating it with automated structure prediction tools. The method's speed and scalability suggest it will be a valuable tool for analyzing time-resolved scattering data and characterizing complex nanomaterials.

Limitations

While ML-MotEx significantly improves the efficiency and interpretability of PDF analysis, some limitations exist. The method requires a starting model that contains the target motif, which necessitates some prior chemical/structural knowledge. While the confidence factor helps assess the reliability of results, mislabelled atoms can sometimes occur in complex starting models. Furthermore, the current implementation is tailored to PDF data, though adapting the method to other techniques is a promising future direction.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Medicine and Health

Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection

Y. Takahashi, M. Ueki, et al.

Medicine and Health

Predictive model of castration resistance in advanced prostate cancer by machine learning using genetic and clinical data: KYUCOG-1401-A study

M. Shiota, S. Nemoto, et al.

Engineering and Technology

Identifying degradation patterns of lithium ion batteries from impedance spectroscopy using machine learning

Y. Zhang, Q. Tang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny