logo
ResearchBunny Logo
Introduction
Machine learning (ML) models are increasingly used in computational chemistry due to their efficiency and accuracy, enabling faster high-throughput screening than traditional physics-based methods. However, the accuracy of ML models heavily relies on the availability of large, clean labeled datasets. Obtaining labeled data is costly, requiring Density Functional Theory (DFT) simulations or experiments. Conversely, vast unlabeled databases of material structures and compositions exist, but are unusable with supervised learning. This research addresses two key questions: (1) Can we develop efficient ML models that learn from unlabeled data, and (2) Can these models enhance supervised learning accuracy? This work utilizes Self-Supervised Learning (SSL) to address these questions. Unlike supervised learning, SSL uses unlabeled data for training, learning generalizable representations applicable to diverse tasks. Recent successful SSL frameworks in computer vision and natural language processing, such as SimCLR, Barlow Twins, and SimSiam, have inspired similar approaches in molecular ML. However, crystalline materials differ significantly from molecules due to their periodic nature, infinite repetition of unit cells, and potential for non-covalent bonds. Therefore, specialized deep learning architectures designed for crystals are necessary. Graph Neural Networks (GNNs) are well-suited for this task, representing crystals as graphs where atoms are nodes and interactions are edges. Promising GNNs include CGCNN, OGCNN, SchNet, and MegNet. While SSL has been applied to molecular ML, its application to periodic crystalline systems for property prediction remains relatively unexplored. This research introduces Crystal Twins (CT), an SSL framework for crystalline material property prediction using GNNs. CT employs the CGCNN as the encoder and uses Barlow Twins and SimSiam loss functions for self-supervised pre-training, learning representations without labeled data.
Literature Review
The introduction provides a comprehensive overview of the existing literature on machine learning in materials science, self-supervised learning techniques, and graph neural networks for material property prediction. It highlights the limitations of supervised learning methods due to the scarcity of labeled data and introduces the potential of self-supervised learning to address this challenge. The review covers prominent self-supervised learning frameworks in computer vision and natural language processing, and discusses their adaptations to molecular machine learning. It also notes the lack of similar research in the context of crystalline materials and the need for specialized deep learning architectures to handle their unique properties, emphasizing the use of graph neural networks (GNNs) as suitable candidates and citing examples of successful GNN architectures for material property prediction.
Methodology
The Crystal Twins (CT) framework comprises two main stages: pre-training and fine-tuning. In the pre-training stage, the CT framework uses self-supervised learning (SSL) to learn representations from unlabeled crystalline material data without using any labeled data. Two different SSL pre-training methods are adapted: Barlow Twins and SimSiam. Both methods use a twin Graph Neural Network (GNN) architecture. Specifically, the CGCNN architecture is chosen due to its computational efficiency and strong performance as a baseline model. For CTBarlow, the pre-training objective is to make the cross-correlation matrix of the embeddings generated by the GNN for two augmented instances of the same crystal as close as possible to the identity matrix. For CTSimSiam, the objective is to maximize the cosine similarity between the embeddings generated from the GNN for two augmented instances. To avoid model collapse in CTSimSiam, a stop gradient operation is applied to one branch. Three augmentation techniques are used to generate augmented instances: random perturbation of atom positions, atom masking (randomly masking 10% of atoms), and edge masking (randomly masking 10% of edge features). In the fine-tuning stage, the pre-trained weights from the CGCNN encoder are used to initialize the model for downstream tasks. A randomly initialized multi-layer perceptron (MLP) head is added to the GNN to predict the material property. The Adam optimizer is used for both the pre-training and fine-tuning stages. For pre-training, a combined dataset from the Matminer database and a hypothetical Metal-Organic Framework dataset containing 428,275 samples was used. The authors also implemented the GIN (Graph Isomorphism Network) architecture to evaluate the general applicability of the SSL methods. The GNNs use a mean pooling readout operation to generate a final latent vector. For Barlow Twins, the embedding dimension is 128; for SimSiam, it's 256. The model is trained for 15 epochs with a batch size of 64 and a learning rate of 0.00001. Specific hyperparameters for both pre-training and fine-tuning stages are provided in supplementary information.
Key Findings
The Crystal Twins (CT) framework's performance was evaluated on 14 datasets, including 9 from the MatBench suite and 5 from previous work on CGCNN and OGCNN. These datasets cover various material properties such as exfoliation energy, phonon frequency, band gap, formation energy, refractive index, bulk modulus, shear modulus, Fermi energy, and metallicity. The results demonstrate that both CTBarlow and CTSimSiam consistently outperform the supervised learning CGCNN baseline on 7 out of 9 MatBench datasets. On the remaining two MatBench datasets, CT performance is within the standard deviation of the supervised model. Comparison with other competitive methods such as AMMExpress and ALIGNN shows that CT outperforms AMMExpress on 6 out of 9 MatBench datasets. While ALIGNN demonstrates superior performance on some tasks, this is attributed to ALIGNN's ability to model three-body interactions, which is not explicitly considered by CGCNN (the encoder in CT). In the 5 datasets used from previous works, CTBarlow and CTSimSiam achieve lower mean absolute error (MAE) compared to CGCNN, OGCNN and GIN baselines. For each dataset, specific MAE values along with standard deviations are provided in the manuscript (Table 2 and Table 3). The visualization of the embedding space using t-SNE shows that the CT models generate embeddings that better separate data points based on formation energy, compared to the embeddings generated by the CGCNN model, demonstrating the efficacy of the self-supervised learning approach. These findings suggest that the CT framework effectively learns generalizable and robust representations that can be transferred to various downstream tasks for material property prediction.
Discussion
The Crystal Twins framework successfully demonstrates the use of self-supervised learning for crystalline material property prediction, achieving superior performance compared to several supervised learning baselines. The improved performance highlights the potential of SSL in mitigating the limitations posed by the scarcity of labeled data in materials science. The framework's generalizability, shown by its effectiveness across diverse material properties, underscores its robustness and wide applicability. The relatively small amount of data used for pre-training compared to other SSL domains is noteworthy. It's expected that the performance would improve further with larger unlabeled datasets, enabling the learning of even more generalizable representations. The findings open avenues for future research in understanding the chemical space of materials and designing new materials with desired properties.
Conclusion
This paper introduces Crystal Twins (CT), a novel self-supervised learning framework for predicting material properties of crystalline materials. CT effectively leverages unlabeled data using a twin GNN architecture and achieves significant improvements in accuracy over several supervised learning baselines. The framework's success in handling diverse properties demonstrates its broad applicability and robustness. Further improvements are expected with larger datasets. Future work could explore the incorporation of three-body interactions and the application of CT to other materials discovery tasks.
Limitations
While the Crystal Twins framework shows significant improvement over supervised methods, the current implementation uses CGCNN as the encoder, which models only two-body interactions, potentially limiting its ability to capture complex three-body interactions and hindering its performance compared to models like ALIGNN. The pre-training dataset size, although larger than in many previous self-supervised studies in material science, is comparatively small to the scales used in other SSL application areas. The study's performance might improve with a substantial increase in the pre-training dataset size. Additionally, hyperparameter optimization was carried out, but a more extensive exploration might lead to additional performance gains.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny