Chemistry

Inhibitor_Mol_VAE: a variational autoencoder approach for generating corrosion inhibitor molecules

H. Gong, Z. Fu, et al.

This study presents Inhibitor_Mol_VAE, a cutting-edge variational autoencoder model designed to generate corrosion inhibitor molecules with specific inhibition efficiencies. The research successfully identifies new molecules that demonstrate high inhibition efficiency at low concentrations, showcasing the model's ability to reconstruct and innovate based on critical physicochemical properties. The groundbreaking work was conducted by Haiyan Gong, Zhongheng Fu, Lingwei Ma, and Dawei Zhang.

00:00

Playback language: English

Index

Introduction

Material corrosion causes significant economic losses and safety hazards in industrial settings. Corrosion inhibitors are a cost-effective and efficient preventative measure. Organic corrosion inhibitors typically contain N, O, S, or P atoms or electron-rich π bonds, forming a protective film on metal surfaces. Their inhibition efficiency (IE) depends on the metal and environmental conditions. Traditional corrosion inhibitor design relies on weight loss and electrochemical measurements, supplemented by theoretical calculations like density functional theory and molecular dynamics simulations. Machine learning approaches, such as support vector machines, artificial neural networks, and random forests, have been used to establish quantitative structure-activity relationships (QSAR) between IEs and molecular structural parameters, enabling the prediction of inhibitor molecules with specific IEs and characteristics. Previous studies have developed QSPR models using physicochemical variables and employed neural networks to predict IEs based on SMILES notation. However, these methods primarily use discriminative models to predict IE under specific conditions, requiring individual predictions across a vast chemical space. This study addresses this limitation by exploring generative modeling. Molecular generative models, including RNNs, graph attention networks, graph VAEs, and VAEs, are increasingly used in drug and protein synthesis. These models learn features from known molecules and generate new ones through sampling. VAEs are particularly suitable due to their ability to create a smooth, continuous latent space, facilitating interpolation and generation of diverse samples. Conditional VAEs (CVAEs) allow for incorporating desired properties into the generation process. This study leverages a VAE-based approach to efficiently generate and screen novel high-IE corrosion inhibitor molecules, analyzing patterns within existing datasets to rapidly discover molecules with targeted characteristics and predict their IEs.

Literature Review

Several studies have explored the use of machine learning for predicting the corrosion inhibition efficiency of molecules. Ser et al. developed linear and nonlinear QSPR models using 20 physicochemical variables (dipole moment, polarizability, HOMO, LUMO, HOMO-LUMO gap, ionization potential, electron affinity, electronegativity, hardness, softness, electrophilicity, electron donor capacity, electron acceptor capacity, number of transferred electrons, N atomic charge, adsorption energy, LogP, van der Waals surface area, solvent accessible surface area, and van der Waals volume) to predict the IEs of 41 n-heterocyclic compounds. Dai et al. and Ma et al. constructed cross-category corrosion inhibitor datasets using SMILES format and developed an IE prediction model integrating 2D-3D molecular representation and concentration information using graph convolutional neural networks and message transfer neural networks. These studies, however, focused on predictive rather than generative models. Other research has employed various deep learning architectures such as RNNs, graph attention networks, and VAEs for molecular generation in drug discovery and protein synthesis, demonstrating the potential of generative models for discovering novel molecules with specific properties.

Methodology

This study developed Inhibitor_Mol_VAE, a VAE-based model trained on a dataset of 1368 organic molecules including molecular structures, descriptors, inhibitor concentrations, and IEs (from Ma et al., 2019, 2023). Three network architectures (RNN, RNNAttn, and TransVAE) were used, exploring both conditional and unconditional molecule generation. The dataset included SMILES notation, concentration (mmol L⁻¹), and IEs. Molecular descriptors (QED, LogP, MolWt, HBA, HBD, TPSA, NumRotBonds) were calculated using RDKit. Additional datasets (Inhibitor1368-0, Inhibitor1368-2, Inhibitor1368-9, Inhibitor1368-13) were created by varying the included attributes. A random forest (RF) model was used for feature importance analysis and IE prediction. Molecules were encoded into numerical representations using a tokenization scheme creating a 128-dimensional vector. Molecular properties were added to these vectors increasing dimensionality to 128-141 depending on which dataset was used. The token dictionary was weighted to address token imbalances, improving model training efficacy. The Inhibitor_Mol_VAE model comprised an encoder and a decoder. The encoder mapped input molecules and properties into a latent space, while the decoder reconstructed the molecules from this latent space. The model was trained by minimizing the reconstruction loss (using MSE), the Kullback-Leibler (KL) divergence between the latent distribution and a prior distribution, and (optionally) the MSE loss for the properties. The model's performance was evaluated through reconstruction accuracy (Reconstruction_acc, Token_acc_mean, Tanimoto_similarity_mean), generation effectiveness (Valid_ratio, Unique_ratio, Valid_new_ratio, None_CAS_number_ratio), and changes in IE-related features (MolWt, LogP, vdw_volume, Electronegativity, IE) between reconstructed and original molecules. Batch generation of new molecules with target properties (concentration, IE) was performed by adding Gaussian or uniform noise to the encoded vector of a reference molecule and decoding the resulting vector. The generated molecules were then screened based on the target properties.

Key Findings

The random forest (RF) model analysis revealed that molecular descriptors were more effective than quantum chemical parameters in predicting IEs. Among the three VAE architectures (RNN, RNNAttn, TransVAE), the Transformer-based models (T_9 and T_13) demonstrated the highest potential for generating novel corrosion inhibitor molecules with improved properties, particularly in generating new molecules with high IE at lower concentrations. In the molecule reconstruction task, the RA_0, RA_13, and R_0 models achieved the highest reconstruction accuracy. However, an analysis of the properties of the reconstructed molecules revealed that models like T_9 and T_13, despite having lower reconstruction accuracy, generated more novel molecules with unique structures. The analysis of the difference in properties (MolWt, LogP, Vdw_volume, Electronegativity, IE) between reconstructed and raw molecules showed that the models that generated more novel molecules had greater differences in these properties. This indicated that models with lower reconstruction accuracy might be better at exploring novel chemical spaces. The screening process for identifying novel corrosion inhibitors with high IEs and low concentrations yielded promising results. The T_9 and T_13 models identified a greater number of molecules meeting these criteria compared to other models. The model identified molecules like CN(C)clccc(/C=C/C2=NN(C(N)=S)C(c3ccc(N(C)C)cc3)C2)cc1 (IE of 0.939 at 0.1 mM) which demonstrated a significant improvement in IE compared to the original molecule at a much lower concentration. In the batch generation of new molecules, the RNN models with low reconstruction accuracy and high Valid_ratio generated a larger number of molecules meeting specified target properties (IE and concentration). Analysis of the relationship between IE difference and Tanimoto similarity indicated that generating a larger number of molecules with lower similarity to the reference molecule is necessary to find molecules with significantly different IEs. Overall, The Transformer-based models were better at generating novel molecules while the RNN model was better at meeting specific target properties in batch generation.

Discussion

Inhibitor_Mol_VAE successfully generated novel corrosion inhibitor molecules with improved properties. The findings highlight the advantages of generative models in molecular discovery, particularly for exploring chemical spaces and identifying molecules with enhanced performance at low concentrations. The use of VAEs, and particularly Transformer-based VAEs, proved effective in generating structurally diverse molecules. The observed trade-off between reconstruction accuracy and novelty generation suggests that the choice of model architecture should be guided by the specific goals of the discovery process. The integration of chemical descriptors as conditional inputs within the VAE framework proved highly beneficial for directing the generation process towards desired properties. The RF model was essential for assessing the significance of molecular properties, and guiding the feature selection process. The study successfully demonstrated the feasibility of using generative AI models to accelerate the discovery of new corrosion inhibitors.

Conclusion

This study successfully demonstrated the application of VAEs, particularly Transformer-based VAEs, for the generation of novel corrosion inhibitor molecules with enhanced properties. The results emphasize the potential of generative AI in expediting the discovery of new materials with specific functionalities. Future work will focus on the synthesis and experimental validation of the predicted molecules, along with exploring more advanced generative models and expanding the scope of molecular properties considered.

Limitations

The study's reliance on a specific dataset (Inhibitor1368) might limit the generalizability of the findings. The accuracy of IE prediction depends on the accuracy of the RF model, which is itself subject to limitations. Further experimental validation of the generated molecules is needed to confirm their efficacy as corrosion inhibitors. The computational cost associated with training the model and generating new molecules might be a limiting factor for large datasets.

Related Publications

Explore these studies to deepen your understanding of the subject.

Biology

A deep learning approach for morphological feature extraction based on variational auto-encoder: an application to mandible shape

M. Tsutsumi, N. Saito, et al.

Interdisciplinary Studies

A multi-disciplinary approach for building a common understanding of genetic engineering for malaria control in Burkina Faso

L. P. Toe, N. Barry, et al.

Education

Crafting a framework: a Delphi method approach to formulating a maker literacy assessment model for primary school students in China

B. Min, F. Alam, et al.

Medicine and Health

A first-in-class leucyl-tRNA synthetase inhibitor, ganfeborole, for rifampicin-susceptible tuberculosis: a phase 2a open-label, randomized trial

A. H. Diacon, C. E. B. Iii, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny