Engineering and Technology

Automated pipeline for superalloy data by text mining

W. Wang, X. Jiang, et al.

This groundbreaking research by Weiren Wang and colleagues introduces a novel natural language processing pipeline for extracting critical data from scientific literature, specifically targeting superalloys. The study successfully analyzes 2531 records, paving the way for a predictive model of γ solvus temperatures with remarkable accuracy. Discover how this work is revolutionizing material design and data utilization!

00:00

Playback language: English

Index

Introduction

Artificial intelligence (AI) and machine learning (ML) are transforming materials research by shifting the paradigm from trial-and-error to a data-driven approach, thereby accelerating the discovery of new materials. High-quality data is crucial for the success of AI/ML methods. Currently, data sources include experimental and calculated results from researchers' own labs and data extracted from published papers. The scientific literature is a rich source of peer-reviewed, reliable data, but manual extraction is time-consuming and labor-intensive, especially considering the tens of thousands of articles written in natural language. The ever-increasing volume of publications makes manual database maintenance challenging for individual researchers. Therefore, automated and accurate data extraction methods are becoming increasingly necessary. Recent research has introduced pipelines for automated data extraction of organic and inorganic chemical substances using natural language processing (NLP) techniques. Named entity recognition (NER) and relation extraction are crucial components of this process. NER methods range from dictionary lookups and rule-based approaches to machine-learned approaches, the latter requiring substantial annotated data for training. Existing tools like ChemDataExtractor have been developed to extract chemical information, but their application to specific materials like superalloys requires adaptation and may not achieve optimal performance. Superalloys, widely used in advanced aero engines and industrial gas turbines, require detailed knowledge of their properties (including transition temperatures) and chemical composition for effective alloy design. With approximately 20,000 articles on superalloys, automated data extraction is crucial to accelerate data-driven superalloy design. Supervised deep learning methods for NER or relation extraction need large hand-labeled datasets, while semi-supervised methods like Snowball require labeled seed samples, which can hinder the achievement of high precision and recall simultaneously. This paper presents an automated NLP pipeline to capture chemical composition and property data of superalloys. This pipeline uses a rule-based NER method and a heuristic text multiple-relation extraction distance-based algorithm, requiring no labeled samples for a small corpus. A common table parsing and relation extraction algorithm is also developed. The study aims to demonstrate the effectiveness of this approach in extracting relevant data from a corpus of 14,425 journal articles and to utilize the extracted data for predictive modeling and materials design.

Literature Review

Several studies have explored the use of NLP for data extraction from scientific literature. Swain and Cole developed ChemDataExtractor, a toolkit for automated extraction of chemical information. Kim et al. used neural network-based methods to extract synthesis parameters. Court et al. used ChemDataExtractor with a modified Snowball algorithm to extract Curie and Néel temperatures for magnetic materials. However, these studies may not directly address the challenges of extracting complex, multi-property data from superalloy literature, such as handling multiple relations within a sentence, dealing with different expression of information in text and tables, and the limited availability of labeled data.

Methodology

The automated text mining pipeline consists of several stages: 1. **Scientific Documents Download and Preprocessing:** The pipeline starts by programmatically downloading scientific articles (in XML, HTML, or plain text format) and preprocessing the raw text to remove irrelevant information such as metadata and to standardize the format of values and units. Elsevier's Scopus and Science Direct APIs were utilized for data retrieval and parsing. 2. **Table Parsing and Text Classification:** Tables and sentences containing relevant information are identified using a combination of rule-based methods, keyword matching, and a pre-trained word embedding model (Word2Vec CBOW) to identify synonyms for target properties like "solvus temperature". 3. **Named Entity Recognition (NER):** This stage recognizes alloy named entities, property specifiers, and property values. A rule-based approach using regular expressions in Python is implemented to identify these entities. The performance of this rule-based NER method was significantly better than that achieved using a BiLSTM-CRF model or ChemDataExtractor, especially given the limited size of the labeled corpus. 4. **Relation Extraction:** This stage aims to identify relationships between the entities recognized in the NER stage. For text, a distance-based algorithm, which does not require labeled data, is used to handle multiple-relation extraction from a single sentence. This algorithm outperforms the Snowball and modified Snowball semi-supervised algorithms in terms of recall and F1 score. For tables, a dedicated algorithm extracts relationships, handling both row-wise and column-wise data organization. The F1 score for table relation extraction was 95.23%. 5. **Interdependency Resolution:** This final stage links chemical composition and property data fragments associated with a specific superalloy. A divide-and-conquer strategy is employed to merge information from different sources (text and tables). The algorithm handles potential difficulties stemming from the use of pronouns or abbreviations as alloy named entities in different parts of the text. The performance of each stage was evaluated using precision, recall, and F1 score. The BiLSTM-CRF model, ChemDataExtractor, Snowball, and modified Snowball algorithms were used for comparative evaluation.

Key Findings

The automated pipeline successfully extracted 2531 records with both chemical composition and physical properties (γ' solvus temperature, density, solidus temperature, and liquidus temperature) from 14,425 articles. * **NER Performance:** The rule-based NER method achieved an F1 score of 92.07% for alloy named entity recognition, significantly outperforming the BiLSTM-CRF model (42.91%) and ChemDataExtractor (24.86%). * **Relation Extraction Performance:** The distance-based algorithm for text relation extraction achieved an F1 score of 79.37% for γ' solvus temperature, exceeding the performance of the Snowball algorithm (33.21%) and modified Snowball (43.28%). The table relation extraction tool showed a high F1 score of 95.23%. * **Predictive Modeling:** A data-driven model was built using the extracted data to predict the γ' solvus temperature. The model accurately predicted the temperatures of 15 superalloys not included in the training data, with a mean relative error of 2.27%. Furthermore, the model was used to design three unexplored Co-based superalloys with predicted γ' solvus temperatures exceeding 1250 °C. Experimental synthesis and characterization confirmed the predictions with a mean relative error of 0.81%. * **Materials Insights:** Analysis of the extracted data revealed trends consistent with the known behavior of superalloys. The dataset illustrated the effects of different alloying elements on γ' solvus temperature and density. For example, the addition of W in Co-9Al-xW ternary alloys increased the γ' solvus temperature and Ta additions were shown to improve γ' solvus temperature and density in Co-Ni-Al-Mo based alloys.

Discussion

The results demonstrate the effectiveness of the proposed NLP pipeline for accurate and efficient data extraction from a relatively small corpus of superalloy literature. The rule-based methods used effectively address the challenges associated with limited labeled data, avoiding the overfitting problems commonly associated with supervised and semi-supervised learning approaches. The high accuracy of the extracted data enabled the development of a predictive model for γ' solvus temperature, showing the potential of this pipeline for accelerating materials discovery. The experimental validation of the model's predictions for previously unexplored Co-based superalloys further supports the reliability and utility of the extracted data. The trends observed in the extracted data are consistent with established understanding of superalloy behavior, indicating the quality of the data and the successful capture of relevant knowledge from the literature.

Conclusion

This work presents a novel automated pipeline for extracting chemical composition and properties of superalloys from scientific literature using NLP techniques. The pipeline achieves high accuracy in NER and relation extraction, even with limited labeled training data. The extracted data is used to build an accurate predictive model for γ' solvus temperature, successfully predicting the properties of new, unexplored alloys. This demonstrates the potential of text mining to accelerate materials discovery by unlocking latent knowledge within existing publications. The open-source availability of the code and web-based toolkit promotes broader adoption and adaptation to other alloy systems. Future work could focus on incorporating additional properties, experimental parameters, and addressing limitations in handling data expressed as ranges or distributed across multiple sentences.

Limitations

Despite the success of the pipeline, some limitations exist. The pipeline does not fully handle property values presented as ranges or those distributed across multiple sentences. The data extracted requires manual cleaning to remove errors and duplicates. Furthermore, synthesis and processing conditions, measurement uncertainties, and other experimental parameters are not currently incorporated, which would enhance the database's completeness and the predictive capabilities of the machine learning models. The accuracy of the word embedding model could also be improved by training with a larger corpus.

Related Publications

Explore these studies to deepen your understanding of the subject.

Business

Mining the impact of social media information on public green consumption attitudes: a framework based on ELM and text data mining

J. Fan, L. Peng, et al.

Biology

DIAMetAlyzer allows automated false-discovery rate-controlled analysis for data-independent acquisition in metabolomics

O. Alka, P. Shanthamoorthy, et al.

Computer Science

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

A. R. Ferreira and C. E. C. Campelo

Chemistry

Identifying candidate hosts for quantum defects via data mining

A. M. Ferrenti, N. P. D. Leon, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny