logo
ResearchBunny Logo
Inhibitor_Mol_VAE: a variational autoencoder approach for generating corrosion inhibitor molecules

Chemistry

Inhibitor_Mol_VAE: a variational autoencoder approach for generating corrosion inhibitor molecules

H. Gong, Z. Fu, et al.

This study presents Inhibitor_Mol_VAE, a cutting-edge variational autoencoder model designed to generate corrosion inhibitor molecules with specific inhibition efficiencies. The research successfully identifies new molecules that demonstrate high inhibition efficiency at low concentrations, showcasing the model's ability to reconstruct and innovate based on critical physicochemical properties. The groundbreaking work was conducted by Haiyan Gong, Zhongheng Fu, Lingwei Ma, and Dawei Zhang.... show more
Introduction

Material corrosion causes significant damage in industrial facilities, prompting the use of protective measures such as coatings, electroplating, and corrosion inhibitors. Organic inhibitors often contain N, O, S, and P atoms and form adherent films on metal surfaces. Inhibition efficiencies (IEs) depend on metals and environments. Traditional discovery relies on weight loss and electrochemical tests, while computational methods (DFT, MD, Monte Carlo) elucidate mechanisms. Machine learning QSAR models (e.g., SVM, ANN, MLR, RF) have predicted IE from molecular features, and recent graph neural networks using SMILES and molecular graphs can predict IE across concentrations. However, these approaches are discriminative and require evaluating vast chemical spaces one molecule at a time. Generative models can accelerate discovery by proposing candidates with desired properties. Variational autoencoders (VAEs) provide a smooth latent space, enabling interpolation and conditional generation. This work develops Inhibitor_Mol_VAE, a VAE-based framework to reconstruct and generate corrosion inhibitor molecules conditioned on molecular descriptors and targeted IEs, aiming to discover novel inhibitors with high IE at low concentration.

Literature Review

Prior QSAR studies have related IE to physicochemical descriptors and quantum chemical properties (e.g., dipole moment, polarizability, HOMO/LUMO, ionization potential, electronegativity, hardness/softness, electrophilicity, LogP, van der Waals surface metrics) and have used linear, nonlinear, and ML models (SVM, ANN, MLR, RF). Recent datasets encoded inhibitors as SMILES and used graph neural networks for cross-category IE prediction over 100–1000 g/mol on carbon steel in 1 M HCl at varying concentrations. In molecular generation, RNNs, attention-based models, graph VAEs, adversarial learning, reinforcement learning, and VAEs have been applied to de novo design and multiproperty control, particularly in drug and protein design. VAEs, via KL-regularized latent spaces, support smooth interpolation and conditional control (CVAE), making them suited to exploring chemical spaces for corrosion inhibitors with desired properties.

Methodology

Dataset: The study uses a corrosion inhibitor dataset of 1368 entries (Inhibitor1368) for carbon steels in HCl solutions (from Ma et al.), including SMILES, inhibitor concentration in mmol L−1 (Concentration_mM), and IE. RDKit computed seven molecular descriptors: QED, LogP, MolWt, HBA, HBD, TPSA, NumRotBonds. Four datasets were prepared: Inhibitor1368-0 (SMILES only), Inhibitor1368-2 (SMILES + Concentration_mM + IE), Inhibitor1368-9 (SMILES + seven descriptors + Concentration_mM + IE), and Inhibitor1368-13 (Inhibitor1368-9 plus counts of N, O, S, P atoms). Encoding: SMILES were tokenized using a 50-token dictionary with start/end symbols; sequence length fixed at 128. Molecular attributes were concatenated to produce input vectors of 128–141 dimensions. Token frequency imbalance was mitigated by weighting tokens using log frequency with a minimum weight (freq_penalty) and clipping to a range. Models: Three VAE encoder-decoder architectures were used: RNN (GRU-based), RNNAttn (RNN with attention over GRU layers), and TransVAE (Transformer-based). Conditional training incorporated property MSE into the loss alongside reconstruction loss and KL divergence. Data splits were 80/10/10 (train/test/validation). Training used batch size 16, warmup steps 100, up to 200 epochs; best epochs were selected by total loss on train/test for each model-dataset pair (R_0, R_2, R_9, R_13, T_0, T_2, T_9, T_13, RA_0, RA_2, RA_9, RA_13), e.g., RA_13 at epoch 30, RA_0 at 185, R_0 at 95, etc. Condition selection via Random Forest (RF): To choose conditional inputs, RF IE-prediction models were trained with (i) RF_input_1: seven descriptors + N/O/S/P counts + Concentration_mM (train RMSE 0.097, MAE 0.066, R² 0.685; validation RMSE 0.093, MAE 0.061, R² 0.689). Top features: MolWt, Concentration_mM, LogP, TPSA, QED, S_Count, NumRotBonds, HBA. (ii) RF_input_2: quantum chemical parameters (dipole_moment, HOMO, LUMO, gap, ionization potential, electron affinity, electronegativity, hardness, softness, electrophilicity, electron donor/acceptor capacity, Vdw_volume, Vdw_surface_area) + N/O/S/P counts + Concentration_mM; top features: Concentration_mM, Vdw_volume, Electronegativity. (iii) RF_input_3: union of (i) and (ii) yielded higher RMSE (0.099), higher MAE (0.0701), and lower R² (0.484) than using descriptors alone. Given computational cost and mixed benefit of quantum parameters, chemical descriptors were used as properties for conditional generation. Evaluation metrics: Reconstruction metrics included Reconstruction_acc (exact SMILES match), Token_acc_mean (average per-position token match rate), and Tanimoto_similarity_mean (Morgan fingerprint similarity). Novel generation effectiveness metrics included Valid_ratio (RDKit-valid SMILES), Unique_ratio (non-redundant fraction), Valid_new_ratio (unique, valid molecules not in training set), none_CAS_number_ratio (proxy for novelty/findability), and property differences (MolWt, LogP, Vdw_volume, Electronegativity) between reconstructed and raw molecules. IE differences were assessed by predicting IE for reconstructed molecules using the trained RF model over common concentrations (e.g., 1, 0.1, 0.01, 0.5, 0.05, 0.0015, etc.). Batch generation: For de novo generation, given a reference SMILES and attributes (target Concentration_mM, target IE, target IE difference, number of molecules, noise type), the encoder produced a latent memory vector m, to which Gaussian (N(0,σ²I)) or uniform noise was added per step (noise σ_i = base × factor). The decoder generated SMILES, which were validated, featurized, and scored (descriptors and RF-predicted IE). Molecules meeting target IE difference thresholds at the specified concentration were retained. A preliminary analysis showed IE difference decreases as Tanimoto similarity increases, motivating generation of lower-similarity molecules for larger IE gains.

Key Findings
  • Reconstruction performance (validation set of 137 molecules): RA_0 achieved the highest Reconstruction_acc (0.825), Token_acc_mean (0.887), and Tanimoto_similarity_mean (0.893), followed by RA_13 (0.708, 0.876, 0.888) and R_0 (0.547, 0.741, 0.852). Models including concentration/IE as conditions (e.g., RA_2) suffered accuracy drops, likely due to data imbalance in concentration and IE.
  • New molecule qualities (reconstruction-based generation): RA_13 balanced validity, uniqueness, and novelty with Valid_ratio 0.925, Unique_ratio 0.784, Valid_new_ratio 0.224, none_CAS_number_ratio 0.714. R_0 showed Valid_ratio 0.9, Unique_ratio 0.778, Valid_new_ratio 0.283. Transformer models T_9 and T_13 had lower validity and uniqueness but higher novelty (Valid_new_ratio 0.364 and 0.261, respectively).
  • Property differences: R_0, R_13, T_0, T_2, and RA_13 yielded reconstructed molecules with small mean differences (near zero) in MolWt and LogP, and also in Vdw_volume and Electronegativity, indicating property preservation. R_2, R_9, T_9, and T_13 showed larger deviations, aligning with their higher novelty.
  • IE differences (reconstructed molecules): IE_diff values mostly within −0.2 to 0.2 across models. Despite R_2 creating many new molecules, many gains were tied to increased carbon count. T_9 and T_13 more effectively identified novel molecules with IE improvements (IE_diff > 0) relative to originals. Example: CN(C)c1ccc(/C=C/C2=NN(C(N)=S)C(c3ccc(N(C)C)cc3)C2)cc1 reached IE 0.939 at 0.1 mM (IE_diff 0.468 vs. original at higher concentration), indicating potential to reduce concentration while maintaining high IE.
  • Batch generation (target IE difference 0.1 at 1 mM, 1000 samples): R_2 produced the most qualifying molecules (Gaussian 39, uniform 41), followed by R_13 (21 Gaussian, 16 uniform). Gaussian noise generally outperformed uniform, likely due to alignment with the VAE’s standard normal latent prior. Models with very high reconstruction accuracy (e.g., RA_0, RA_13) or low Valid_ratio tended to regenerate originals or invalid strings, reducing yield of qualifying novel molecules.
  • Overall, RNNAttn without conditions (RA_0) and with extended conditions (RA_13) excelled at accurate reconstruction; Transformer-based models (T_9, T_13) excelled at novelty and finding higher-IE candidates at lower concentrations during reconstruction-based screening. RNN with minimal conditions (R_2) was most effective for condition-targeted batch generation.
  • Discovery: The framework identified novel inhibitors with high IE at low concentrations, including examples such as [ethoxy(methoxy)phosphoryl]-phenylmethanol and (alpha-methylamino-benzyl)-phosphonsaeure-monoaethylester.
Discussion

The study demonstrates that a VAE-based conditional generative model can learn chemically valid latent representations of corrosion inhibitors and generate novel candidates with targeted attributes. By comparing architectures, RNNAttn achieved the best reconstruction fidelity, indicating strong retention of structural information, while Transformer-based models, though less accurate in reconstruction, explored chemical space more broadly, yielding higher novelty and more candidates with predicted IE improvements at reduced concentrations. Random Forest analyses showed that readily computed chemical descriptors (MolWt, LogP, TPSA, QED, etc.) and concentration strongly influence IE predictions, while adding quantum chemical features did not improve predictive accuracy and increases computational cost. Consequently, conditioning on chemical descriptors is a pragmatic and effective strategy. The observed inverse relation between structural similarity and IE difference suggests that substantial IE gains require exploration away from the reference molecule in latent space, aligning with the batch-generation results where models favoring diversity generated more improved candidates. Collectively, these findings support the use of Inhibitor_Mol_VAE to both reconstruct and propose inhibitors with high predicted efficiency at lower dosages, addressing the need to efficiently search vast chemical spaces beyond discriminative screening.

Conclusion

This work presents Inhibitor_Mol_VAE, a conditional VAE framework (RNN, RNNAttn, and Transformer variants) for generating corrosion inhibitor molecules with targeted inhibition efficiency. Using a curated dataset of 1368 inhibitors with descriptors and concentrations, the method encodes SMILES into a continuous latent space and decodes to valid molecules while conditioning on molecular properties. Key contributions include: (i) establishing effective tokenization and weighting for SMILES; (ii) systematically comparing architectures and conditioning levels; (iii) demonstrating that descriptor-based conditioning suffices for targeted generation; (iv) discovering candidate inhibitors with higher predicted IE at lower concentrations; and (v) providing a batch generation pipeline leveraging Gaussian noise for efficient candidate discovery. Future work will focus on synthesizability assessment and laboratory validation (surface analysis, weight-loss, electrochemical testing) of top candidates and on extending models to composite inhibitors and broader environmental conditions.

Limitations
  • Data imbalance in concentrations (primarily 0.001–5 mM with peaks at <1 mM) and IE (peak 0.8–0.9) affected conditional models (e.g., RA_2) and may bias generation toward common conditions.
  • Limited dataset size (N=1368) constrains generalization and the diversity of learnable chemistries; some classes are underrepresented.
  • IE improvements are assessed via an RF predictor trained on descriptors rather than experimental validation; prediction errors (R² ~0.69 for descriptor model) propagate to candidate ranking.
  • Quantum chemical features, while potentially informative, were not used in generation due to computational complexity and mixed predictive benefit; this may limit capturing subtle electronic effects.
  • Trade-off between reconstruction fidelity and novelty: models with high reconstruction accuracy tend to produce fewer novel candidates; Transformer models with higher novelty showed lower validity/uniqueness in some settings.
  • CAS-based novelty estimates may be incomplete; lack of CAS does not guarantee novelty or synthesizability.
  • Stereochemistry handling and SMILES tokenization choices may influence chiral correctness and chemical validity of outputs.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny