Introduction
The discovery of new materials with targeted properties requires an integrated approach combining experiments, theory, and computation. The Composition-Synthesis/Processing-Structure-Property-Performance paradigm guides this exploration. However, the vast materials space, encompassing innumerable combinations of components and processing routes, poses a significant challenge. Machine learning (ML) has emerged as a powerful tool for materials discovery, improving synthesis efficiencies through integration with robotic control and automated experiments. Yet, designing alloys with desired properties requires considering the myriad of processing routes in addition to chemical and structural complexities. This research focuses on using data science to identify optimal synthesis-design routes for producing a desired alloy from specified starting materials.
Early approaches to materials data extraction from scientific articles were primarily manual. Advances in text mining and natural language processing (NLP) enable the conversion of scientific text into machine-learning-oriented data collections. NLP pipelines have been developed for extracting chemical composition and property data from various alloys. However, the complexity and diversity of synthesis and processing descriptions, along with the lack of annotated corpora, hinder accurate and efficient extraction. Deep learning (DL) offers powerful capabilities for analyzing unstructured data, but training DL-based models requires extensive high-quality annotations, which are often lacking for alloy synthesis and processing. Previous studies have demonstrated limited success in extracting detailed processing parameters using DL models.
This paper addresses these challenges by employing a semi-supervised text mining method. This approach leverages a relatively small amount of labeled data and a large amount of unlabeled data to improve model performance, making it possible to achieve high information extraction (IE) performance for superalloy synthesis and processing with less domain-specific expertise and corpus annotation. Specifically, semi-supervised recommendation and multi-level bootstrapping algorithms are developed for token-level and chunk-level actions respectively.
Literature Review
Existing literature highlights the challenges of extracting materials synthesis and processing information from scientific text. Manual extraction is time-consuming and inefficient, while deep learning approaches require large, labeled datasets that are often unavailable. Rule-based methods can be effective for specific entity types, but are cumbersome and expensive for the complex and flexible expressions found in alloy synthesis descriptions. Previous work has explored different approaches to text mining for materials science, including using neural networks for parameter recognition and qualitative topic extraction. However, these methods often achieve limited accuracy and generalizability or necessitate extensive manual labeling. This work builds on previous efforts to extract superalloy chemical compositions and properties, addressing the need for a more efficient and accurate method to extract the complex sequential data describing synthesis and processing.
Methodology
The proposed automated text mining pipeline consists of several key stages: article retrieval and preprocessing, paragraph classification, action dictionary generation, named entity recognition (NER), dependency parsing, and interdependency resolution.
**Article Retrieval and Preprocessing:** The raw corpus of superalloy articles (16,604 articles published before 2022) was parsed and organized into paragraphs. Existing tools were used for corpus preprocessing and table parsing.
**Paragraph Classification:** A binary logistic regression classifier was trained on manually labeled paragraphs to identify paragraphs containing synthesis and processing information. Different feature extraction methods (Bag of Words, TF-IDF, BERT) were compared, with TF-IDF yielding the highest F1 score (96.35%).
**Semi-supervised Action Dictionary Generation:** A semi-supervised method was developed to generate a comprehensive action dictionary, requiring only a small number of seed actions. This involved a token-level action recommendation algorithm and a multi-level bootstrapping algorithm for chunk-level actions. The token-level algorithm used Word2Vec and FastText embeddings to recommend similar words and a voting process to select high-confidence actions. The chunk-level algorithm used a modified multi-level bootstrapping algorithm to generate noun and verb phrases for synthesis and processing actions. The generated dictionaries were then used for NER.
**Named Entity Recognition (NER):** The generated action dictionary was used for NER, employing POS tagging, POS screening, and relaxed matching to recognize action entities in various formats. Regular expressions were used to recognize synthesis parameters. The proposed method outperformed the BILSTM-CRF model in terms of precision, recall, and F1 score.
**Dependency Parsing:** Dependency parsing was used to establish the relationships between action entities and their parameters. The Stanford CoreNLP package was employed to construct dependency parse trees and extract three-tuple relations (subject, action, parameter).
**Interdependency Resolution:** Rules were developed to link the extracted chemical composition information with the synthesis and processing routines. This involved matching alloy names and compositions from composition tables with corresponding actions in the text.
**Symbolic Regression (SR):** Symbolic regression was used to identify a synthesis factor that correlates with γ′ phase coarsening. Genetic programming SR (GPSR) was employed to generate candidate formulas, with the best formula selected based on mean absolute error (MAE) and complexity. The resulting factor, (A*S0.5)0.5 + Ac, was found to significantly improve the performance of the y' size prediction model.
**Prediction Model for γ′ Size:** Various machine learning regression models were trained to predict γ′ size, using the extracted data and the derived synthesis factor. The models were evaluated using root mean square error (RMSE).
Key Findings
The semi-supervised text mining method successfully extracted 9853 superalloy synthesis and processing actions with chemical compositions from a corpus of 16,604 articles. The method achieved high accuracy in action entity recognition (F1 score of 89.28%), outperforming the BILSTM-CRF model. Data visualization revealed insights into the trends of different synthesis actions over time and correlations between solution and aging temperatures and alloy composition. A Markov chain representation captured the transition probabilities between different actions in different synthesis processes (casting, wrought, powder metallurgy). Symbolic regression identified a significant synthesis factor, (A*S0.5)0.5 + Ac, where A represents aging temperature and time, and S represents solution temperature. This synthesis factor significantly improved the performance of data-driven γ′ size prediction model, reducing the RMSE from 143.63nm to 83.00 nm. The improved model accurately predicted γ′ size for newly reported and synthesized superalloys.
Discussion
The findings demonstrate the effectiveness of the proposed semi-supervised text mining approach for extracting detailed and complex synthesis and processing information from scientific literature. The high accuracy achieved in extracting action entities and parameters, along with the identification of a key synthesis factor influencing γ′ phase coarsening, highlights the value of this method for materials science research. The use of semi-supervised learning makes the method efficient and cost-effective, requiring only limited manual annotation. The extracted dataset and derived synthesis factor significantly improved the performance of data-driven models for predicting γ′ size, demonstrating the potential for this method to accelerate materials design and discovery. The visualizations of the extracted data provide valuable insights into the trends and relationships in superalloy synthesis and processing techniques.
Conclusion
This research successfully developed and validated a semi-supervised text mining pipeline for extracting alloy synthesis and processing information. The method's high accuracy and efficiency, coupled with the identification of a key synthesis factor, offer significant advantages for materials science research. Future work could focus on addressing the limitations of the current method, including handling implicitly expressed actions and improving dependency parsing accuracy. The open-source code and web-based toolkit provide valuable tools for the materials science community. The approach used here can be generalized to other alloy systems, enabling the extraction and analysis of large datasets from the scientific literature.
Limitations
The current method has some limitations. It primarily focuses on explicitly expressed actions and parameters, ignoring implicitly expressed information and parameters described in figures. The dependency parser may struggle with complex sentence structures or relationships spanning multiple sentences. The accuracy of the symbolic regression model, while showing a significant improvement, is limited by the quality and size of the dataset and the inherent complexity of predicting the γ' size. Further improvements could involve addressing the limitations of dependency parsing, incorporating information from figures, and exploring more advanced NLP techniques for handling nuanced language.
Related Publications
Explore these studies to deepen your understanding of the subject.