logo
ResearchBunny Logo
Searching for chromate replacements using natural language processing and machine learning algorithms

Engineering and Technology

Searching for chromate replacements using natural language processing and machine learning algorithms

S. Zhao and N. Birbilis

This study, conducted by Shujing Zhao and Nick Birbilis, showcases how natural language processing and machine learning can uncover alternatives to chromate for corrosion protection. By analyzing an extensive database of scientific literature, the research reveals how automated interpretation can yield expert insights in materials science.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the challenge of finding replacements for hexavalent chromium (chromate), a historically effective but carcinogenic corrosion inhibitor used across metals such as Al, Zn, and Mg. With corrosion imposing ~3% of GDP in costs globally, industry requires alternatives that approach chromate’s cost-effectiveness, passivation strength, and applicability across alloys. Prior reviews suggest no single drop-in replacement exists, though many systems (rare-earth, vanadate, lithium-containing, organic, phosphate, nanocomposites, metal-rich primers) show promise. Given that most materials knowledge exists as unstructured text, the authors explore whether unsupervised NLP can automatically interpret corrosion literature—without human-in-the-loop—to surface chromate alternatives. They compare Word2Vec (context-independent word embeddings) to BERT (contextual, subword-tokenized, masked language model) to test if NLP can achieve expert-level insights in identifying viable replacement strategies.
Literature Review
The study builds on text mining in materials science and NLP methods for unstructured literature. Word2Vec has previously uncovered latent structure–property relations and predicted new materials (e.g., thermoelectrics), but produces single, context-independent vectors and cannot handle out-of-vocabulary (OOV) words. BERT, pre-trained via masked language modeling on large corpora with subword tokenization (WordPiece), yields contextual embeddings and can handle OOV by composing subwords. Prior works established supervised NLP methods (SVM, Bayesian networks, ME, CRF), but most open literature lacks labels, motivating unsupervised approaches. Clustering and unsupervised word embeddings have been useful in materials science; here, Word2Vec and BERT are applied unsupervised to the corrosion domain to identify chromate alternatives and benchmark against expert-curated categories.
Methodology
Data collection and preprocessing: From 84 million Scopus records, 5990 entries related to corrosion protection were collected using title/abstract/keyword queries with 'chrom' and 'replace*' or 'substitute', then filtered by 'alumin*', 'zinc', 'magnesium', 'alloy', 'steel', or 'iron' to ensure relevant substrates. Non-English, copyright-limited, missing-content, and non–peer-reviewed items were removed, yielding 1812 abstracts for training. XML formatting/tags and leading words (e.g., 'Abstract') were removed. For Word2Vec, elements, numbers, and units were tokenized (e.g., #element, #nUm, #unit). Chemical formulas were normalized alphabetically and recognized via ChemDataExtractor, pymatgen, regex, and rules. Text was lowercased except chemical formulas and abbreviations. For BERT, a chemical-domain tokenizer with >250k tokens performed subword tokenization; special tokens [CLS] and [SEP], positional and segment embeddings were used. Word2Vec training: Implemented with gensim following Tshitoyan et al. Hyperparameters were tuned via grid search on a development set (10% of data), evaluated by analogy score on 14,042 materials-science analogy pairs. Selected hyperparameters: learning rate 0.001, embedding size 300, batch size 128, epochs 30. Chromate alternatives were ranked by cosine similarity to the 'chromate' vector; top-1000 nearest terms were filtered for materials/chemicals relevance. BERT pretraining and fine-tuning: Started from chemical-bert-uncased (pretrained from SciBERT) using >40,000 chemical technical documents and >13,000 chemistry Wikipedia articles. The model was fine-tuned with masked language modeling (MLM) on the corrosion abstracts (unsupervised). Hyperparameters were tuned on a 10% development set using perplexity; grid included epochs {10, 20, 30}, batch sizes {16, 32}, and learning rates {1e-5, 1e-4, 1e-3}. The best configuration was selected based on lowest perplexity; full fine-tuning was then performed on the processed abstracts. Six fill-mask prompts were designed to elicit potential chromate alternatives: (1) 'hexavalent chromium can be replaced by [MASK]'; (2) 'hexavalent chromium may be replaced by [MASK]'; (3) 'chromate can be replaced by [MASK]'; (4) 'the best corrosion inhibitor is [MASK]'; (5) '[MASK] performed better than chromate'; (6) 'the best conversion coating is [MASK]'. For each prompt, the top-1000 predictions were collected and filtered to materials/chemicals relevant to corrosion. Evaluation and benchmarking: For Word2Vec, analogy score was used during tuning; for MLM, perplexity was used. For chromate replacement assessment, the filtered predictions were compared against an expert-derived benchmark list of 20 chromate-replacement categories compiled from three independent sources not used in training. Predictions were tallied per category to assess coverage against benchmarks.
Key Findings
- From each model’s top-1000 outputs, Word2Vec identified 54 relevant materials as potential chromate alternatives. BERT’s six masked prompts yielded between 30 and 85 relevant materials each (counts per prompt: 79, 80, 85, 36, 43, 30), totaling 161 unique relevant suggestions across all six. - Overlap between Word2Vec and BERT relevant suggestions was 19%. - Word2Vec’s 54 relevant suggestions included 45 that matched benchmark categories, yielding an 83.3% benchmark-related rate. The first three BERT prompts outperformed the latter three by nearly 2× in benchmark-related outputs, with sentence structure affecting performance (nouns as [MASK] tended to work better). - Category coverage: Word2Vec missed four benchmark categories (trivalent chromium, titanium conversion coatings, zinc-based coatings, calcium-based systems). BERT covered all 20 benchmark categories at least once. Examples uniquely surfaced by BERT included: trivalent chromium coatings (fluoride, hydroxide, HF, ceramic; frequencies 25/52/24/291); titanium conversion coatings (titanium, Ti; 300/253); zinc-based coatings (zinc, Zn; 298/443); calcium-based systems (calcium, Ca; 87/83). - BERT surfaced low-frequency and zero-frequency terms (e.g., CVD frequency 1; terms like 'formaldehyde', 'acrylate' absent from the corrosion corpus) owing to subword tokenization and transfer from domain pretraining, whereas all Word2Vec results appeared at least once in-corpus. - Non-benchmark terms among Word2Vec outputs included epoxy, PVD, hard chromium plating, diamond-like, sol, neodymium, lanthanum, clays, magnesium, Nd. - Overall, BERT provided broader discovery (more relevant suggestions and full category coverage), while Word2Vec aligned more tightly with expert benchmarks on a per-suggestion basis.
Discussion
The central question—can unsupervised NLP automatically identify viable chromate-replacement strategies from literature without expert curation—is affirmed. Word2Vec, by leveraging proximity to 'chromate' in embedding space, effectively recovered known benchmark alternatives with high precision relative to established expert lists. BERT, using contextual masked predictions, provided broader exploration across the corrosion domain, including low/zero-frequency and indirectly associated candidates, thereby covering all 20 benchmark categories. This demonstrates complementary strengths: Word2Vec excels at retrieving established chromate-adjacent knowledge; BERT uncovers contextually relevant, less explicitly connected, or OOV candidates. The structure of masked prompts significantly influences BERT’s outputs; prompts with noun-focused masks reduce noise and improve material-centric predictions. Together, these findings indicate that unsupervised NLP can replicate much of the expert literature-mining process and extend it by proposing plausible, context-derived alternatives beyond direct chromate co-occurrence, advancing automated discovery in corrosion protection.
Conclusion
The study demonstrates that unsupervised NLP can automate the search for chromate replacements from scientific literature. Word2Vec, via cosine similarity to 'chromate', most accurately retrieved benchmark-aligned alternatives, while BERT, through masked language modeling with chemical-domain pretraining and corrosion fine-tuning, produced the broadest set of relevant suggestions, including low/zero-frequency terms and full coverage of 20 expert-defined categories. Both models captured essentially all expert-curated chromate-replacement technologies without domain-specific supervision, indicating NLP’s capacity to summarize and extend expert knowledge. Future work should broaden sources (e.g., patents, websites) and integrate mechanistic descriptors (materials properties, corrosion mechanisms) to enrich predictions and identify novel alternatives beyond current benchmarks.
Limitations
- Corpus limited to Scopus-accessible abstracts and filtered queries; full-text content, patents, and web sources were excluded, potentially narrowing coverage. - English-only filtering and removal of certain items (e.g., copyright-limited) may bias the dataset. - Evaluation relied on comparison to a benchmark category list; absence of ground-truth labels limits quantitative metrics beyond analogy scores and perplexity. - Manual post-filtering to retain materials/compounds introduces some subjectivity. - BERT prompt design affects outputs; comparing six prompts to a single Word2Vec run is not strictly like-for-like. - Results identify candidates but do not experimentally validate performance as chromate replacements.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny