
Engineering and Technology
Automated classification of big X-ray diffraction data using deep learning models
J. E. Salgado, S. Lerman, et al.
Unlock the potential of your X-ray diffraction data with a groundbreaking deep learning model developed by Jerardo E. Salgado and colleagues. This innovative approach overcomes the hurdles of traditional analysis, offering unparalleled accuracy in classifying crystal systems and space groups while adapting seamlessly to diverse experimental conditions.
~3 min • Beginner • English
Introduction
The response of materials at extreme pressures depends strongly on atomic arrangement and crystal structure, which govern mechanical, electrochemical, and thermodynamic properties. Powder X-ray diffraction (XRD) is the standard technique for characterizing crystal symmetry, lattice parameters, atomic packing, and phases. However, current indexing and refinement require expert intervention, context knowledge, and manual tuning (e.g., cell indexing, parameter initialization), which is challenging especially with multiphase samples, overlapping peaks, or when contextual knowledge is unavailable. Meanwhile, modern synchrotron methods generate massive XRD datasets that exceed manual analysis capacity, creating a need for adaptable, automated analysis. Deep learning (DL) offers the potential to learn rich representations from large datasets and generalize beyond known references. The goal of this work is to develop generalized DL models that robustly classify both crystal systems (7 classes) and space groups (230 classes) directly from XRD patterns, with strong generalization to experimental and previously unseen materials.
Literature Review
Prior works have explored machine learning and deep learning for diffraction analysis, including pattern decomposition, clustering, symmetry classification, structure–property relationships, and phase mapping. Park et al. trained CNNs on synthetic XRD patterns for crystal system and space group classification, achieving strong synthetic performance but limited generalization to experimental patterns (failing on one of two tested). Vecsei et al. trained DNN and CNN models with good synthetic accuracy (e.g., 86% on crystal systems) that dropped to 56% on the experimental RRUFF dataset, highlighting a generalization gap. Other studies addressed narrower subsets of symmetry classes or specific material groups. Overall, despite progress, there remains a need for robust models that generalize to dynamic and unseen real XRD data across diverse materials and experimental conditions.
Methodology
Data generation and training datasets: From 204,654 ICSD crystallographic information files, 171,076 distinct structures remained after removing incomplete/duplicate entries. Seven synthetic XRD datasets were generated to reflect varying experimental conditions via different peak-shape (Caglioti) parameters and noise implementations. Datasets used for training: (i) Baseline: synthetic dataset 1 (171k patterns), (ii) Mixed: random sample without replacement from datasets 1–4 (171k patterns), (iii) Large: union of all seven synthetic datasets (~1.2 million patterns). Synthetic XRD generation used Bragg’s law to compute peak positions and Lorentz, polarization, and structure factors for intensities. Peak shapes were implemented via Gaussian-like broadening with FWHM H from the Caglioti formula H = (U·tan²θ + V·tanθ + W)^(1/2), with dataset-specific U,V,W and intensity noise (0.2–2%) prior to normalization. Patterns were simulated at Cu Kα (λ=1.54 Å), 2θ range 5°–90° (0.01° step), and normalized so the maximum peak intensity equals 1000.
Model architectures: Three supervised classifiers were designed for 7-way (crystal systems) and 230-way (space groups) tasks. (1) Standard CNN (SCNN): 3 convolutional layers (channels [80,80,80], kernel sizes [100,50,25], strides [5,2,1]), ReLU activations, 30% dropout after each conv, average pooling after each conv (kernel sizes [3,3], strides [2,1]), followed by a 3-layer MLP head [2300, 1150, num_classes] with ReLU and 50% dropout after each activation during training. (2) No-Pooling CNN (NPCNN): same conv stack and dropout as SCNN but without pooling layers; flattened features feed the same 3-layer MLP head. (3) MLP: 5 ReLU-activated layers [4000, 3000, 1000, 800, num_classes] with dropouts [60%, 50%, 40%, 30%] after each ReLU during training. Training used softmax cross-entropy loss, Adam optimizer, batch size 256. Implemented in PyTorch (with UnifelM library); hyperparameters selected for strong performance on experimental evaluation.
Domain adaptation (sourcing): To adapt to experimental artifacts (e.g., instrument geometry, temperature/pressure, crystallite size, impurities), 50% of the RRUFF experimental patterns were included in training atop the Large synthetic dataset, forming the Soupling (sourcing) Large Dataset (≈1.2M synthetic + ~452 experimental patterns). The remaining 50% of RRUFF was held out for testing.
Evaluation datasets: (1) RRUFF: 908 experimental XRD entries of minerals with substantial experimental variability relative to synthetic simulations; used to assess real-world generalization and domain-adapted models. (2) MP dataset: 1250 inorganic crystals from the Materials Project, selected for potential electromagnetic properties (distinct class distribution vs. RRUFF and training); synthetic patterns generated via the same pipeline; used to test generalization to unseen, distribution-shifted materials. (3) Lattice Augmentation dataset: 296 cubic crystals from the 171k set were uniformly scaled (lattice constants compressed/expanded) from 80% to 120% of original size, generating 490 patterns; tests robustness to peak shifts (e.g., alloying, elastic compression) without symmetry change.
Training regime: Models were trained on Baseline, Mixed, Large, and Soupling Large datasets. Synthetic-only models achieved near-perfect convergence on synthetic data; mixed/large training introduced variability for robustness. Pooling ablations (NPCNN) tested the effect of preserving local peak relationships without downsampling. Performance was assessed via accuracy and F1 score; confusion matrices analyzed class bias.
Key Findings
- Synthetic-only performance: SCNN and MLP reached ~96% (7-way) and ~94% (230-way) on synthetic data, indicating full learning of the synthetic distributions but not necessarily generalization.
- Mixed vs. Large synthetic datasets: Adding variability improved generalization; mixed datasets yielded sizable gains over baseline, while the larger dataset provided marginal additional improvements, suggesting mixed variations captured most relevant factors.
- Pooling ablation and experimental performance (RRUFF, trained on Large): NPCNN achieved 74% accuracy on 7-way classification, outperforming SCNN (~46%) and MLP (~67.7%). For 230-way, NPCNN reached 66% vs. SCNN at 28%, indicating substantial benefits from removing pooling and preserving local peak relations.
- Domain adaptation via sourcing (Soupling Large): Including ~50% of RRUFF in training significantly improved experimental performance, indicating the model learned to handle experimental artifacts (e.g., instrument effects, defects, impurities) even with a small fraction of experimental data (~452 samples) within ~1.2M synthetic.
- MP dataset (trained on Soupling Large): 7-way accuracies were SCNN 75%, NPCNN 67%, MLP 54%; 230-way accuracies were SCNN 45%, NPCNN 36%, MLP 25%. Distribution shift affected rankings; larger SCNN architecture excelled here despite NPCNN’s advantages on RRUFF.
- Lattice Augmentation: Both MLP and NPCNN maintained high accuracy across lattice scalings; at 80% size, MLP 88% and NPCNN 94%. Highest accuracies occurred near small deviations (e.g., 98–102%), consistent with classification driven by relative peak spacing/intensity per Bragg’s law rather than absolute positions.
- F1 scores and bias (RRUFF, Soupling Large): F1 scores were NPCNN 0.859, MLP 0.753, SCNN 0.568, aligning with accuracy trends. Confusion matrices revealed bias toward overrepresented space groups (e.g., 2, 12, 14, 15, also 62, 139), evidencing class-imbalance effects. Scatter analyses showed variability and that some underrepresented classes were still accurately classified.
Overall: Removing pooling (NPCNN), data variability, and domain adaptation delivered state-of-the-art experimental performance, especially in 230-way space group classification.
Discussion
Generalization to real-world XRD requires both realistic training data and architectures aligned with diffraction physics. Augmenting synthetic data (peak-shape variations, noise) and incorporating a small amount of experimental data (domain sourcing) improved robustness across evaluation sets. Architectural analysis showed that preserving local relationships between peaks and maintaining translation equivariance while retaining positional order (NPCNN without pooling) aligns with Bragg’s-law-based reasoning and yields better experimental performance. Conversely, pooling layers induce permutation invariance that can erase crucial positional information, degrading performance on real patterns. MLPs rely more on absolute positions and lack translation equivariance, limiting robustness to peak shifts.
Bias analyses confirmed that class imbalance induces predictions toward frequent space groups, potentially inflating accuracy without sound reasoning. While accuracy and F1 scores on RRUFF and MP were strong, mitigating imbalance (e.g., reweighting, duplicating rare classes) should further improve out-of-distribution classification. The lattice augmentation results supported that successful models rely on relative peak spacing/intensity rather than exact absolute positions, crucial for dynamic in situ experiments (e.g., compression or alloying) where peaks shift without symmetry change.
Conclusion
This work presents generalized deep learning models for automated classification of crystal systems (7-way) and space groups (230-way) directly from XRD patterns. By generating large, physics-grounded synthetic datasets, optimizing architectures to preserve local peak relationships (NPCNN), and adapting to experimental data via sourcing, the models achieve state-of-the-art performance on experimental (RRUFF) and unseen materials (MP) and exhibit robustness to systematic peak shifts (lattice augmentation). The study elucidates how architectural properties (receptive field, permutation invariance, translation equivariance) govern reasoning on diffraction data and demonstrates that removing pooling and focusing on relative peak relations enhances generalization.
Future directions include: (i) building beam-source-agnostic models by using momentum transfer Q instead of 2θ; (ii) mitigating class imbalance via data balancing or loss reweighting; (iii) extending methods to other spectroscopy modalities (e.g., Raman, NMR) that share intensity-vs.-coordinate structure; and (iv) designing architectures that induce relative peak analysis while preserving crucial ordering information.
Limitations
- Class imbalance in the 171k ICSD-derived dataset leads to bias toward overrepresented space groups, affecting out-of-distribution performance and interpretability.
- Domain adaptation used only the RRUFF dataset (approximately half for training), and experimental coverage remains limited relative to synthetic data volume.
- MP evaluation patterns were synthetically generated; results may differ on fully experimental MP-like data under varied instruments and sample conditions.
- Models were developed primarily for Cu Kα (λ=1.54 Å) and 2θ representations; generalization across beam sources and geometries may require Q-space training.
- Synthetic noise and peak-shape models may not capture all real experimental artifacts (e.g., severe preferred orientation, texture, complex multiphase overlap).
- Architectural and hyperparameter choices optimized for reported datasets; performance may vary with different distributions or extreme conditions.
Related Publications
Explore these studies to deepen your understanding of the subject.