Engineering and Technology
Alloy synthesis and processing by semi-supervised text mining
W. Wang, X. Jiang, et al.
Discover how Weiren Wang, Xue Jiang, Shaohan Tian, Pei Liu, Turab Lookman, Yanjing Su, and Jianxin Xie have revolutionized the field of alloy synthesis and processing. Their innovative semi-supervised text mining method extracts crucial parameters from superalloy articles, enhancing predictive models and bridging the gap between synthesis and alloy structure.
~3 min • Beginner • English
Introduction
The study addresses how to extract and leverage synthesis and processing information at scale to guide alloy design, particularly for superalloys where microstructure and properties are strongly affected by processing routes. While machine learning has accelerated materials discovery, the vast combinatorial space of compositions and processing pathways makes identifying optimal synthesis routes challenging. Prior text-mining advances show promise, but limited annotated corpora and the complexity of procedural language hinder accurate extraction of actions and parameters. The authors propose a semi-supervised text mining approach that minimizes annotation requirements to extract sequences of synthesis/processing actions and their parameters from the literature, enabling downstream modeling of structure–processing relationships such as γ′ precipitate coarsening.
Literature Review
Early materials data extraction relied on manual curation. Recent NLP pipelines automatically extract compositions and properties for various materials, including superalloys and aluminum alloys. For synthesis procedures, deep learning NER/IE approaches require substantial labeled data; e.g., Kim et al. trained a neural network on ~20 oxide-materials articles (F1 ≈ 81% for synthesis parameters), and Kononova et al. annotated 834 solid-state synthesis paragraphs to train a BiLSTM-CRF model (F1 ≈ 90% on operation entities). Huo et al. extracted qualitative topics from procedures but not detailed parameters. The authors’ prior SuperalloyDigger pipeline used rule-based NER and heuristic IE to capture compositions and properties from 14,425 superalloy articles, but extending rules to complex, flexible synthesis entities is cumbersome. Supervised DL needs extensive annotation and is domain-sensitive. Thus, a semi-supervised strategy leveraging small seed sets and large unlabeled corpora is motivated for robust extraction of sequential actions and parameters in alloy synthesis.
Methodology
Corpus acquisition and preprocessing: Full texts of 16,604 superalloy articles (published up to 2022) were collected via Elsevier Scopus/ScienceDirect APIs and ChemDataExtractor’s extended scraper. Text was parsed into paragraphs; tables were parsed via the prior SuperalloyDigger pipeline. Paragraph classification: 1,885 paragraphs from 80 articles were labeled (synthesis-related vs not), split 90/10 for train/test. A logistic regression classifier (scikit-learn) using TF-IDF (with simple section-title heuristics) achieved F1 ≈ 96.35%. Applying the model, 14,487 paragraphs were identified as synthesis/procedure related. Semi-supervised action dictionary generation: Two complementary components were developed. (1) Token-level action recommendation: Starting from 20 expert-provided seed actions, candidate entities were collected using POS and contextual syntactic templates. Word2Vec and FastText embeddings, pre-trained on the ~16k-article corpus, were used to compute cosine similarities to seeds. A voting scheme with thresholds (optimized around w=0.46 for Word2Vec and fw=0.7 for FastText) recommended candidates; entities receiving >2 votes were manually screened and added to seeds iteratively until convergence. This yielded 717 token-level action entities (including seeds). (2) Chunk-level action bootstrapping: Extending multi-level bootstrapping to noun/verb phrases, with improved scoring. From 20 seed phrases, context patterns (before/after windows matched to syntactic templates) were extracted and ranked using frequency-reliability scores. Candidate entities were scored by the number/strength of matched patterns, with hyperparameters optimized (T0=2, k=1). Iterative selection produced 1,199 chunk-level action entities. Named Entity Recognition (NER): Using the generated action dictionary, sentences underwent POS tagging, POS-based screening, and relaxed matching to recognize token- and chunk-level action mentions expressed in varied forms (e.g., aged, aging treatment, two-step aging). Parameters with units (e.g., temperatures, times) were extracted via regular expressions. Dependency parsing and tuple construction: Stanford CoreNLP dependency parsing linked recognized actions to subjects (samples) and parameters (e.g., via nsubj and obl relations). Entities were normalized (e.g., VBD/VBN to id+Ved, VBG to id+Ving, NPs to id+NP) to enable robust pattern interpretation. Extracted tuples formed ordered action sequences within paragraphs. Interdependency resolution: Compositions extracted from tables (20,368 instances) were associated with action sequences using rules reflecting superalloy writing conventions: link to preceding alloy names in the paragraph if present; otherwise, associate all table-listed alloys when none are explicitly mentioned in the paragraph. Comparative DL baseline: A BiLSTM-CRF NER model trained on 1,308 annotated sentences (five-fold cross-validation) provided a baseline; the semi-supervised method was evaluated on 348 randomly selected sentences. Data analysis and modeling: The extracted dataset was visualized for temporal/action-category coverage and action transition probabilities (Markov chains for casting, wrought, powder metallurgy). Correlations among solution/aging temperatures, γ′ solvus, and composition were analyzed. Symbolic regression (gplearn GPSR) on 137 records with γ′ size, composition (Co, Al, W, Ni, Ti, Cr, Ta, Mo, Re, Nb), and processing conditions (solution temperature/time, aging temperature/time) generated 60,000 candidate formulas; a prominent factor (At, Sc, Ac) in the form (At·Sc^0.5)^0.5 + Ac was identified. ML regressors (SVR, GBR, RFR, KRR, KNN, AdaBoost, etc.) were trained with original features vs transformed feature space including the derived factor; performance compared via repeated 5-fold cross-validation and held-out testing on newly reported (2023) alloys and three lab-synthesized alloys.
Key Findings
- Scale of extraction: 55,206 actions captured from 16,604 articles; 13,211 of these actions linked to explicit parameters; 20,368 composition instances extracted. Interdependency resolution yielded 9,853 complete records combining compositions with ordered action sequences and parameters.
- Semi-supervised dictionary: 717 token-level and 1,199 chunk-level action entities generated from minimal seeds (20 each), using Word2Vec/FastText recommendation and improved multi-level bootstrapping.
- Extraction performance (validated on 30 randomly sampled articles/348 sentences):
- Action NER (overall): Precision 90.58%, Recall 88.03%, F1 89.28% (vs BiLSTM-CRF F1 74.95%).
- Parameters with units: Precision 98.49%, Recall 94.91%, F1 96.67% (vs BiLSTM-CRF F1 86.98%).
- Token-level actions: Precision 92.85%, Recall 90.28%, F1 91.55%.
- Chunk-level actions: Precision 88.23%, Recall 85.71%, F1 86.95%.
- Action+parameter tuples: Precision 81.31%, Recall 84.09%, F1 82.68%.
- Temporal/action coverage: Heatmap shows rising activity since 2013 and increased reporting of quenching, aging, cutting, solution treatment, cooling, polishing.
- Composition–temperature relations: Alloys with low Cr and high (Ti+Ta) generally use higher solution temperatures (>1250 °C) than high-Cr, low-(Ti+Ta) alloys, consistent with γ′ solvus increasing with Ti/Ta and decreasing with Cr.
- γ′ solvus positioning: For each alloy, γ′ solvus temperatures lie between solution and aging temperatures, consistent with known rules.
- Process sequencing: Markov-chain analysis reveals common sequences and transitions across casting, wrought, and powder metallurgy routes (e.g., casting sequences often include solidification → solution → aging → quench, with bidirectional repetition among cool, solution, and aging steps).
- Synthesis factor discovery and predictive impact: Symbolic regression identified a salient factor (At·Sc^0.5)^0.5 + Ac (At: aging time; Sc: solution temperature; Ac: aging temperature) positively correlated with γ′ size. Incorporating this factor into ML models reduced average RMSE on new 2023 alloys and three in-house alloys from 143.63 (original features) to 83.00 (transformed feature space), demonstrating substantial performance gains.
Discussion
The semi-supervised pipeline effectively extracts complex, sequential synthesis and processing actions and their parameters from unstructured superalloy literature with minimal labeled data. By surpassing a BiLSTM-CRF baseline and achieving high precision/recall, it addresses the bottleneck of limited annotations and domain-specific variability. Linking actions to compositions enables analyses that reflect expert intuition—such as transition probabilities among processing steps and the placement of γ′ solvus between solution and aging temperatures. Crucially, the text-mined dataset supports discovery of an explicitly expressed synthesis factor, (At·Sc^0.5)^0.5 + Ac, that captures key processing influences on γ′ coarsening. Incorporating this factor enhances ML prediction accuracy for γ′ size on prospective alloys, directly addressing the research goal of connecting synthesis routes to microstructure. The approach thus complements data-driven materials design by providing scalable, structured synthesis knowledge and enabling interpretable, physics-consistent features for models.
Conclusion
This work demonstrates a semi-supervised text-mining pipeline that, with minimal seeds and no extensive annotations, extracts ordered synthesis/processing actions, parameters, and links them to alloy compositions at scale. From 16,604 superalloy articles, the method produced 9,853 complete synthesis–composition records and 55,206 actions overall, achieving strong NER and linkage performance and outperforming a BiLSTM-CRF baseline. Analyses of the extracted data yield meaningful scientific insights (e.g., composition-dependent solution temperatures, γ′ solvus placement, process transition patterns) and enabled identification of a synthesis factor, (At·Sc^0.5)^0.5 + Ac, that significantly improves predictive modeling of γ′ size. The open-source code and web toolkit facilitate reuse and adaptation to other alloy systems. Future work could incorporate implicit actions, cross-sentence linkage, and figure-derived parameters, and explore integration with large language models to further enhance extraction and reasoning across composition–process–structure relationships.
Limitations
- Implicitly expressed actions (e.g., elliptical or coreferential mentions like “the aged samples…”) are not captured.
- Dependency parsing (Stanford CoreNLP) can mislink subjects, actions, and parameters under certain sentence structures.
- Action–parameter tuples spanning multiple sentences are not handled.
- Processing parameters described only in figures are not incorporated.
- One annotator for the BiLSTM-CRF baseline and limited labeled data may constrain cross-domain generalization of supervised models (used only for comparison).
Related Publications
Explore these studies to deepen your understanding of the subject.

