Engineering and Technology
Automated pipeline for superalloy data by text mining
W. Wang, X. Jiang, et al.
This groundbreaking research by Weiren Wang and colleagues introduces a novel natural language processing pipeline for extracting critical data from scientific literature, specifically targeting superalloys. The study successfully analyzes 2531 records, paving the way for a predictive model of γ solvus temperatures with remarkable accuracy. Discover how this work is revolutionizing material design and data utilization!
~3 min • Beginner • English
Introduction
Artificial intelligence and machine learning are transforming materials research by shifting from trial-and-error to data-driven methodologies. High-quality, well-characterized data is crucial for AI/ML success. Besides laboratory-generated data, the scientific literature contains vast peer-reviewed and reliable data; however, manual extraction is time-consuming and impractical at the scale of tens of thousands of articles. With the rapidly growing literature, automated, rapid, and accurate data extraction has become necessary. Superalloys, critical for turbine applications, require knowledge of multicomponent compositions, phase transition temperatures, and synthesis conditions for design. Given the ~20,000 articles on superalloys, automated extraction and assimilation of existing literature data is essential to accelerate data-driven superalloy design. This work proposes an automated NLP pipeline to capture both chemical composition and property data from 14,425 journal articles, enabling global analysis. The pipeline utilizes rule-based named entity recognition and a heuristic distance-based multiple-relation extraction algorithm that requires no labeled samples, alongside a table parsing and relation extraction module. The system achieves high F1 scores and compiles extracted tuples into a structured materials database, which is then used to analyze trends and build predictive models for γ′ solvus temperature, including validation via synthesis of predicted Co-based superalloys.
Literature Review
Prior work in chemistry and materials NLP has introduced pipelines for automatic extraction of chemical substances using techniques like named entity recognition (NER) and relation extraction, employing dictionary look-ups, rule-based, and ML approaches. Kim et al. used neural and parse-based methods to extract synthesis parameters from over 640,000 articles (F1 ~81%). ChemDataExtractor recognizes chemical entities and extracts relations using dictionaries, ML, and grammar rules, and has been used with modified Snowball algorithms to extract Curie and Néel temperatures (precision ~73%) from ~68,000 articles. However, terms like Curie/Néel also appear in non-magnetic contexts, complicating extraction. Supervised deep learning methods require extensive labeled data, and even semi-supervised methods like Snowball need seed labels, making simultaneous high precision and recall challenging, particularly for smaller, specialized corpora like superalloys.
Methodology
The automated text mining pipeline comprises: (1) document retrieval and preprocessing, (2) text classification and table parsing, (3) named entity recognition (NER), (4) text and table relation extraction, and (5) interdependency resolution to merge composition and property data into complete records.
- Retrieval and preprocessing: 14,425 superalloy articles (plain text, XML, HTML) were programmatically archived primarily via Elsevier Scopus and ScienceDirect APIs, plus CrossRef and Web of Science for DOI lists. Preprocessing isolated relevant domains, converted XML/HTML to plain text paragraphs, parsed tables into nested cell lists, removed metadata, and normalized value-unit formats using regular expressions.
- Text classification: A rule-based approach assisted by a synonym dictionary identifies sentences containing target properties. A Word2Vec CBOW model trained on ~14,000 full texts generated candidate synonyms (e.g., for “solvus”), which were manually curated. Regular expressions detected numeric values and units (°C, K).
- Table parsing and classification: XML/HTML tables were extracted (using a modified table_extractor or pandas) to a list-of-rows format with captions. Tables were classified as composition or property based on caption keywords (e.g., “composition”, “solvus temperature”).
- Rule-based NER: Tailored regular expressions recognize alloy named entities (e.g., compositional forms like 8Al1W2Mo, designations like IN738LC, or generic forms like “alloy 718”), property specifiers (e.g., γ′ solvus temperature, density), and property values with units (e.g., 1050 °C, 850–950 °C). Nine alloy naming patterns and multiple value/unit patterns are used to label positive spans in text and tables.
- Text relation extraction: Relations are modeled as quaternary tuples <DOI, alloy name, property specifier, property value>. The principal challenge is multiple-relation extraction within single sentences. A distance-based algorithm without labeled data is introduced: (a) shortest-distance matching greedily pairs each alloy entity with the nearest value entity and links to the closest property specifier when counts differ (n≠k or p≠k), and (b) sequential matching pairs entities in order when counts are balanced with a single property (e.g., n=k with p=1). Entity positions (token indices) form the distance metric.
- Table relation extraction: After detecting table orientation (by row or by column) using the positions of element headers or property labels, alloy names (from captions or body) are indexed and joined with element/property headers and corresponding cell values to output tuples <DOI, alloy name, property specifier (element), property value (content)>.
- Interdependency resolution: Composition tuples <DOI, alloy, element, content> and property tuples <DOI, alloy, property, value> are merged into complete 6-tuples <DOI, alloy, element, content, property_specifier, property_value> using a divide-and-conquer strategy: first intersect by DOI, then by alloy named entity (handling abbreviations via “full name(abbreviation)” patterns). Pronouns and abbreviations are resolved where possible.
- Baselines and evaluation: A BiLSTM-CRF NER model and ChemDataExtractor serve as NER baselines; Snowball and modified Snowball are relation extraction baselines. Precision, recall, and F1 scores are used for all classification/extraction tasks. The pipeline is released as open-source (GitHub) and a web toolkit.
- ML model for γ′ solvus temperature: From 743 complete records (composition + γ′ solvus), data were cleaned (preference for table over text values, experimental over calculated, mode selection for conflicting reports, averaging ranges, unit harmonization). Several regressors were compared via grid search and cross-validation; SVR with RBF kernel performed best. The model was validated on 15 later-reported superalloys and used to design and experimentally validate three new Co-based superalloys.
Key Findings
- Corpus and throughput: From 14,425 articles, within ~3 hours, the pipeline extracted 12,703 composition tuples and 5,136 property instances across γ′ solvus (1,259), density (2,296), solidus (793), and liquidus (788); 2,531 records successfully matched composition with properties.
- NER performance: For alloy named entities, the rule-based NER achieved precision 90.58%, recall 93.60%, F1 92.07% (545 sentences), outperforming BiLSTM-CRF (F1 42.91%) and ChemDataExtractor (F1 24.86%). Property value recognition attained precision 85.71%, recall 99.25%, F1 91.98% (845 sentences).
- Text relation extraction: On 329 sentences, the distance-based method achieved precision 75.86%, recall 83.22%, F1 79.37%, exceeding Snowball (best F1 33.21%) and modified Snowball (best F1 43.28%). 680 γ′ solvus instances were extracted from text.
- Table extraction: From 5,441 parsed tables (5,327 composition; 114 solvus), 12,703 composition tuples and 579 property tuples were extracted. Manual validation on 45 articles yielded precision 90.89%, recall 100%, F1 95.23%.
- Interdependency resolution: For γ′ solvus, 743 complete records (composition + property) were formed from 12,703 composition and 1,259 solvus instances.
- Overall property extraction quality: For γ′ solvus, density, solidus, and liquidus, average precision 83.67%, recall 93.08%, F1 88.13%.
- Materials insights: Trends consistent with superalloy behavior were recovered: higher γ′ solvus with higher Ni and Ta; Co-9Al-xW shows increasing γ′ solvus with W; Ti addition raises γ′ solvus in Co-Ni-Al-Mo alloys; observed variability across reports for identical alloys (e.g., Co-9Al-10W: 980–1060 °C).
- Predictive modeling and validation: An SVR (RBF) model trained on cleaned data predicted γ′ solvus for 15 subsequently reported superalloys with mean relative error 2.27%. Three designed Co-based superalloys—Co-36Ni-12Al-2Ti-1W-4Ta-4Cr, Co-36Ni-12Al-2Ti-1W-4Ta-6Cr, Co-12Al-4.5Ta-35Ni-2Ti—were synthesized and characterized, showing measured γ′ solvus of 1251 °C, 1239.3 °C, and 1263 °C, with mean relative error 0.81% versus predictions.
Discussion
The pipeline addresses the core challenge of automatically extracting linked composition-property data from a relatively small, domain-specific corpus without relying on large labeled datasets. The rule-based NER tailored to superalloy nomenclature and the distance-based, label-free multiple-relation extraction algorithm together deliver high recall and precision, outperforming common baselines (BiLSTM-CRF, Snowball variants) under limited data conditions. Robust table parsing and relation extraction provide additional high-fidelity data, enabling the assembly of thousands of structured records. The extracted database reveals known materials trends—such as Ni/Ta increasing γ′ solvus and W enhancing γ′ solvus in Co-9Al-xW—corroborating the utility and validity of the automatically extracted information. Variations in reported values across studies highlight the importance of capturing data provenance and uncertainties. Crucially, the structured dataset enables predictive modeling: the SVR model generalizes to new literature-reported alloys with low error and successfully guides the discovery of new Co-based superalloys with high γ′ solvus, validated experimentally. This demonstrates how literature-derived data, made accessible via NLP, can inform materials design and accelerate discovery.
Conclusion
This work introduces an automated NLP pipeline that downloads and preprocesses scientific documents, classifies text and tables, performs rule-based NER, executes label-free distance-based text relation extraction and robust table relation extraction, and resolves interdependencies to produce structured 6-tuples linking composition and properties. Across 14,425 superalloy articles, the system extracted 12,703 composition tuples and 5,136 property instances, producing 2,531 matched records with strong performance (e.g., alloy NER F1 92.07%; relation extraction F1 up to 79.37%; table extraction F1 95.23%). The compiled database enabled analyses consistent with known superalloy behavior and powered a predictive SVR model for γ′ solvus with low errors on subsequent literature and on newly designed Co-based alloys validated via synthesis and DSC. The open-source code and web toolkit provide a generalizable framework for alloy data extraction. Future work should incorporate synthesis/processing parameters, handle cross-sentence and ranged value expressions more accurately, and scale to additional properties and materials classes to further enhance predictive modeling and materials discovery.
Limitations
- Residual errors and duplicate records may occur; manual cleaning is still required for downstream use.
- Property values expressed as ranges (e.g., “between ... and ...”) are not fully captured; current handling averages explicit numeric ranges but misses some linguistic forms.
- Cross-sentence relations where complete tuple information spans multiple sentences are not yet resolved.
- Pronouns and abbreviations (e.g., “this alloy”, short forms) impede accurate linkage between property mentions and full composition records.
- Some compositions appear only in text (not tables), limiting successful matching.
- Synthesis/processing conditions and measurement uncertainties are not incorporated, affecting interpretability and generalizability of property comparisons.
Related Publications
Explore these studies to deepen your understanding of the subject.

