Engineering and Technology

Crystal Twins: Self-Supervised Learning for Crystalline Material Property Prediction

R. Magar, Y. Wang, et al.

Discover the groundbreaking Crystal Twins (CT) method developed by Rishikesh Magar, Yuyang Wang, and Amir Barati Farimani, which harnesses self-supervised learning to effectively predict material properties using large unlabeled datasets. This innovative approach, employing twin Graph Neural Networks, has shown remarkable improvements in GNN performance across 14 benchmarks.

00:00

~3 min • Beginner • English

Index

Introduction

Machine learning-based predictive models have shown strong efficiency and accuracy for computational chemistry and materials science, enabling high-throughput screening. A key bottleneck is the scarcity and cost of acquiring large, clean labeled datasets, which typically require DFT simulations or experiments. Meanwhile, large unlabeled databases of crystal structures are readily available but cannot be used directly for supervised learning. This motivates two questions: can we learn structural chemistry from unlabeled data, and can such learned representations improve supervised downstream tasks? To address this, the authors leverage self-supervised learning (SSL), which uses the data itself for supervision to learn robust, generalizable representations. While SSL has excelled in vision and NLP and has been explored for molecules, periodic crystalline materials differ fundamentally from molecules (e.g., infinite periodicity, non-covalent interactions), requiring specialized crystal-focused architectures. Graph neural networks (GNNs) are prominent for crystalline property prediction, representing atoms as nodes and interatomic interactions as edges, with notable models including CGCNN, OGCNN, SchNet, and MEGNet. Despite SSL successes in molecular ML, there has been a lack of SSL methods tailored to periodic crystals. The paper introduces Crystal Twins (CT), an SSL framework for crystalline materials using a twin GNN encoder (CGCNN). During pre-training, no labels are used. Two variants are proposed: CT_Barlow (Barlow Twins loss) and CT_SimSiam (SimSiam loss). The model generates two augmented views of the same crystal via random perturbations, atom masking, and edge masking, then encourages their embeddings to match (via cross-correlation identity for CT_Barlow or cosine similarity with stop-gradient asymmetry for CT_SimSiam). The pre-trained encoder weights are transferred to downstream supervised tasks, yielding improved performance across 14 datasets.

Literature Review

The study builds on SSL advances such as SimCLR, Barlow Twins, BYOL, SwAV, MoCo, and SimSiam that achieved state-of-the-art results in vision and NLP, and inspired molecular ML SSL methods like MoICLR, dual-view molecule pre-training, and 3D Infomax. For crystalline materials, prior work predominantly uses GNNs tailored to periodic structures, including CGCNN, OGCNN, SchNet, MEGNet, and others, plus recent progress in materials generation and discovery. However, SSL methods had largely been developed for finite molecules rather than periodic crystals, leaving a gap for crystal-specific SSL approaches that this work addresses by adapting twin-network SSL to crystal graphs.

Methodology

Crystal Twins (CT) is a self-supervised pretraining framework for crystal-property GNNs. The encoder is CGCNN, chosen for competitiveness and efficiency. Two SSL objectives are implemented: - CT_Barlow (Barlow Twins): For two augmented views of the same crystal, embeddings are produced by a shared CGCNN encoder and an MLP projector. The empirical cross-correlation matrix between the two embedding sets is driven toward the identity matrix, reducing redundancy and encouraging invariance. The loss is the sum of squared deviations from identity plus off-diagonal suppression (with lambda = 0.0051 as in the original Barlow Twins work). - CT_SimSiam (SimSiam): Two augmented views are passed through the same CGCNN; one branch has a predictor MLP head and the other applies stop-gradient. The objective maximizes the similarity (cosine-like) between the predictor output and the stop-gradient embedding, using an asymmetric architecture to avoid collapse. Augmentations: Following ideas from AugLiChem, three augmentations are used to form two distinct views per crystal at each epoch: (1) random perturbation of atomic positions by a distance sampled uniformly between 0 and 0.05 Å; (2) atom masking by randomly masking 10% of atoms; (3) edge masking by randomly masking 10% of edge features. Encoder and heads: CGCNN produces crystal-level embeddings using mean pooling (typical latent dimension noted as 64 for downstream embeddings; pretraining used 128 for CT_Barlow and 256 for CT_SimSiam). A 2-layer MLP projector is attached during pretraining for both variants; CT_SimSiam also adds a predictor head on one branch. Training details: Pretraining uses Adam with LR 1e-5, batch size 64, for 15 epochs. Pretraining data aggregates 428,275 unlabeled structures from Matminer databases and a hypothetical MOF dataset, split 95%/5% train/validation. During fine-tuning for supervised tasks, a randomly initialized 2-layer MLP head is added, and the encoder is initialized with the pre-trained weights. Hyperparameters for fine-tuning follow MatBench protocols (nested 5-fold CV) for the MatBench datasets; additional details are in the supplementary information. Downstream tasks: The framework is evaluated on 14 benchmarks (13 regression and 1 classification), covering properties such as exfoliation energy, phonon frequency peak, band gap, formation energy, refractive index, bulk and shear moduli, Fermi energy, and metallicity. Comparisons include CGCNN, AMMExpress, ALIGNN, OGCNN, and GIN baselines.

Key Findings

- Across 9 MatBench datasets with nested 5-fold CV, CT_Barlow and CT_SimSiam outperform the supervised CGCNN baseline on 7/9 tasks. Improvements are seen on JDFT (exfoliation energy), Phonons (formation energy), Dielectric (refractive index), GVRH (shear modulus), KVRH (bulk modulus), Perovskites (formation energy), and MP-BG (band gap). On MP-FE (formation energy), CT models are slightly worse than CGCNN (CGCNN: 0.033 ± 0.001; CT_Barlow: 0.037 ± 0.001; CT_SimSiam: 0.037 ± 0.000). On Is Metal, CT performance is within CGCNN’s standard deviation (CGCNN: 0.952 ± 0.007; CT_Barlow: 0.945 ± 0.004; CT_SimSiam: 0.947 ± 0.003). - Compared to AMMExpress on MatBench, CT models outperform on 6/9 datasets. - Compared to ALIGNN, CT models generally underperform on regression tasks (ALIGNN benefits from modeling three-body interactions), but outperform ALIGNN on the classification task Is Metal. - On 5 additional regression benchmarks (from OGCNN/Materials Project/other sources), CT achieves state-of-the-art among compared baselines. Notable MAE results (mean ± std over 3 runs): • HOIP: CT_SimSiam 0.140 ± 0.004 (best); CT_Barlow 0.153 ± 0.003; OGCNN 0.164 ± 0.013; CGCNN 0.170 ± 0.013. • Lanthanides (band gap): CT_SimSiam 0.054 ± 0.001 (best); CT_Barlow 0.058 ± 0.001; OGCNN 0.072 ± 0.002; CGCNN 0.080 ± 0.003. • Fermi Energy: CT_SimSiam 0.384 ± 0.004 (best); CT_Barlow 0.399 ± 0.004; CGCNN 0.400 ± 0.003; OGCNN 0.446 ± 0.018. • Formation Energy: CT_SimSiam 0.024 ± 0.001 (best); CT_Barlow 0.025 ± 0.001; OGCNN 0.035 ± 0.001; CGCNN 0.040 ± 0.001. • Band Gap: CT_SimSiam 0.302 ± 0.001 (best); CT_Barlow 0.328 ± 0.002; OGCNN 0.353 ± 0.008; CGCNN 0.369 ± 0.003. - Representation quality: t-SNE visualization on Perovskites shows CT fine-tuned embeddings organize crystals more coherently with respect to formation energy compared to supervised CGCNN embeddings. - Data efficiency: CT achieves these gains using pretraining on 428k unlabeled structures, far less than typical SSL scales in CV/NLP, suggesting further room for improvement with more unlabeled data.

Discussion

The findings demonstrate that self-supervised pretraining on unlabeled crystal structures yields transferable representations that improve accuracy and robustness on diverse crystalline property prediction tasks. By enforcing invariance across augmented views of the same crystal, CT learns structure-aware embeddings that generalize across datasets and properties. The consistent gains over supervised CGCNN on most MatBench tasks, the strong results on five additional benchmarks, and competitive classification performance against ALIGNN indicate that SSL can mitigate labeled data scarcity in materials informatics. The analysis underscores the role of architectural inductive biases: ALIGNN’s explicit modeling of three-body interactions confers advantages on many regression tasks, whereas CT (built on CGCNN’s two-body interactions) still achieves notable improvements via SSL but does not surpass ALIGNN except on classification. The t-SNE embedding structure further suggests that CT captures chemically meaningful organization, supporting its utility for exploring materials chemical space and aiding downstream discovery workflows.

Conclusion

This work introduces Crystal Twins, a twin-network self-supervised framework for crystalline materials using CGCNN encoders with Barlow Twins and SimSiam objectives. Without labeled data during pretraining, CT learns robust representations that substantially improve downstream property prediction on 14 benchmarks, outperforming strong supervised baselines and achieving state-of-the-art results on multiple datasets. The approach is simple, label-efficient, and extensible to diverse crystalline properties. Future directions include scaling pretraining to much larger unlabeled crystal corpora, integrating architectures that capture higher-order interactions (e.g., three-body terms), exploring additional crystal-specific augmentations, and extending CT to generative design and active learning pipelines.

Limitations

- Architectural constraint: Using CGCNN as the encoder models primarily two-body interactions; methods like ALIGNN that incorporate three-body interactions outperform CT on many regression tasks. - Pretraining data scale: SSL generally benefits from very large unlabeled datasets; the study uses ~428k structures, smaller than typical SSL regimes in CV/NLP, suggesting room for further gains with more data. - Mixed results on MatBench MP-FE and Is Metal: CT underperforms CGCNN on MP-FE and is only within standard deviation on Is Metal, indicating that gains are not universal across all tasks. - Augmentation choices: The study uses three augmentations (perturbation, atom masking, edge masking); broader or task-specific augmentation strategies might further improve representation learning.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

Hierarchical Molecular Graph Self-Supervised Learning for property prediction

X. Zang, X. Zhao, et al.

Computer Science

MD-HIT: Machine learning for material property prediction with dataset redundancy control

Q. Li, N. Fu, et al.

Health and Fitness

Self-supervised learning of accelerometer data provides new insights for sleep and its association with mortality

H. Yuan, T. Plekhanova, et al.

Biology

Self-supervised machine learning for live cell imagery segmentation

M. C. Robitaille, J. M. Byers, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny