Introduction
Corrosion of metals and alloys poses a significant global economic problem, estimated to cost over US$1 trillion annually. Hexavalent chromium (chromate) has long been the industry standard for corrosion inhibition due to its exceptional performance in passivating various metals. However, chromate is a known human carcinogen, prompting the urgent need for effective and safe chromate-free alternatives. While numerous chromate-free strategies exist, few match chromate's cost-effectiveness, passivation capabilities, and broad applicability across different metals and alloys. Previous reviews suggest that a single, direct 'drop-in' replacement for chromate is unlikely. This research addresses the challenge of identifying chromate alternatives by leveraging the power of natural language processing (NLP) and machine learning to analyze a vast corpus of scientific literature. The rapidly expanding body of material science knowledge, largely published as unstructured text, presents a significant hurdle for traditional data analysis methods. NLP offers a powerful tool to extract meaningful information from this textual data. This study investigates the application of unsupervised NLP techniques, specifically Word2Vec and BERT, to automate the search for chromate replacements without human intervention. Word2Vec, a word embedding method, creates vector representations of words, allowing for the identification of similar words based on their proximity in the vector space. BERT, a more advanced bidirectional encoder representation model, processes words and sub-words, enabling the understanding of context and handling of out-of-vocabulary words.
Literature Review
Existing research on chromate alternatives has explored several promising candidates, including rare-earth-based inhibitors and coatings, vanadate-based coatings used in aerospace, lithium-containing coatings, organic coatings, nanocomposites, phosphate coatings, and metal-rich primers. However, these alternatives often lack the comprehensive performance and wide applicability of chromate. The application of text mining and NLP to materials science has shown promise in extracting structure-property relationships and predicting new materials. Unsupervised machine learning algorithms, such as clustering, have been used to create machine learning datasets and identify noisy data. Word2Vec, an unsupervised word embedding method, has been used to uncover structure-property relationships in materials. However, its context-independent nature and inability to handle Out-of-Vocabulary (OOV) words limit its capabilities. This study utilizes both Word2Vec and BERT to overcome these limitations. BERT's ability to learn deep bidirectional representations from unlabelled text and handle OOV words makes it a powerful tool for this task.
Methodology
The study utilized a dataset of 5990 papers on corrosion protection, extracted from a database of over 80 million records using Scopus APIs. The selection criteria focused on papers containing terms related to chromate replacement, corrosion inhibitors, and relevant substrate materials. Abstracts were filtered to include only English-language publications. Data preprocessing involved removing XML tags and irrelevant words. For the Word2Vec model, element and element names, numbers, and units were converted into tokens. Material formulas were normalized alphabetically. The text was then converted to lowercase, except for chemical formulas and abbreviations. For the BERT model, sub-word tokenization was used. The pre-trained chemical-bert-uncased model from Hugging Face was fine-tuned using masked language modeling. The Word2Vec model was trained using a gensim model, with hyperparameters optimized using a grid search and analogy score evaluation metric. The BERT model's hyperparameters (epoch, batch size, learning rate) were tuned using a development dataset and evaluated based on perplexity. Six masked sentences were used to probe the BERT model for chromate alternatives. The top 1000 results from both models were then analyzed to identify relevant materials (chemicals, compounds, etc.) related to corrosion protection, excluding common terms and non-material terms. The identified materials were categorized and compared to a benchmark list of chromate replacements derived from three independent expert reviews. The Word2Vec model's predictions were based on cosine distance to the 'chromate' vector. The BERT model’s predictions came from filling the blanks in the six masked sentences. The benchmark list covered 20 categories of chromate alternatives.
Key Findings
The Word2Vec model identified 54 relevant materials as potential chromate replacements, with 45 (83.3%) matching the benchmark list. The BERT model, across six masked sentences, identified a total of 161 relevant suggestions, with the number of relevant results varying significantly depending on the structure of the masked sentence. The number of relevant suggestions for chromate alternatives from the Word2Vec and BERT models is shown in Figure 2. The first three masked sentences of the BERT model yielded significantly more relevant results compared to the latter three. BERT identified some low-frequency and even zero-frequency materials (out-of-bag words) not present in the training dataset, demonstrating its ability to generate novel suggestions. There was a 19% overlap between the materials identified by Word2Vec and BERT. The BERT model successfully identified materials in all 20 benchmark categories, whereas the Word2Vec model missed four categories. The Word2Vec model exhibited a higher benchmark-related rate (83.3%), but the BERT model demonstrated broader coverage across the benchmark categories. (Figures 3 and 4 illustrate the comparison between the two models and the benchmark list). The input sentence structure in the BERT model significantly influenced the prediction performance; sentences with nouns as masked words generally yielded better results. Analysis of four benchmark categories uniquely identified by BERT revealed that although these materials did not explicitly appear with 'chromate' in the training data, they were contextually linked and considered relevant replacements.
Discussion
The study demonstrates the capability of NLP to automate the search for chromate replacements, eliminating the need for manual literature review. While Word2Vec offered high accuracy in identifying known alternatives based on its cosine similarity measure, BERT showcased superior breadth by uncovering both low-frequency and novel candidates based on contextual understanding. The difference in performance highlights the strengths of each approach: Word2Vec excels at identifying materials similar to known chromate replacements, while BERT excels at identifying broader contextually relevant alternatives, even those not explicitly mentioned alongside 'chromate' in the corpus. The performance difference between the two models is partially due to the different ways each model operates. The Word2Vec model performs a singular analysis, whereas the BERT model's analysis can be repeated multiple times. The inherent nature of the models tends to favor the BERT model in terms of its ability to further explore the space of potential chromate alternatives. This observation suggests BERT learns from context, creates instance-specific sentence embeddings, and thus has a broader interpretation of corrosion-protection-relevant materials. Future work should focus on expanding the training corpus to include patents, websites, and other data sources to further improve the model's performance.
Conclusion
This study successfully applied NLP to automate the search for chromate replacements in corrosion protection. Both Word2Vec and BERT effectively identified known chromate alternatives from the literature, eliminating the need for manual review. BERT demonstrated superior breadth, identifying more candidates, including low-frequency and novel suggestions. This initial demonstration highlights the potential of NLP for accelerating materials discovery and addressing complex engineering challenges. Future research could expand the training data to enhance performance and explore the integration of materials properties and corrosion mechanisms.
Limitations
The study's reliance on the Scopus API might limit the comprehensiveness of the dataset. The evaluation was primarily based on the match rate with a pre-defined benchmark list, which might not fully capture the potential of the models to identify novel, yet effective chromate replacements. The performance of the BERT model showed dependence on the structure of the input sentence, suggesting further improvements may be achievable through refined prompting strategies.
Related Publications
Explore these studies to deepen your understanding of the subject.