Engineering and Technology

A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing

P. Shetty, A. C. Rajan, et al.

Discover how a team of researchers including Pranav Shetty and others developed an automated pipeline that extracts valuable material property data from the ever-growing polymer literature. Using an innovative language model, they processed thousands of abstracts in a matter of hours, yielding insights that could reshape applications like fuel cells and solar technology.

00:00

Playback language: English

Index

Introduction

The annual growth rate of materials science publications is 6%, making manual analysis challenging. Quantitative material property information is embedded within natural language text, hindering machine readability and the identification of quantitative trends. This data scarcity impacts materials informatics, where training property predictors relies on painstaking manual data curation. This research proposes adapting natural language processing (NLP) techniques for information extraction to address these limitations. Information extraction in NLP involves named entity recognition (NER), relation extraction, co-referencing, and named entity normalization. Transformer-based models, such as BERT, pre-trained on vast unlabeled text corpora, are the dominant paradigm for information extraction. NER and relation extraction architectures commonly feed labeled input to BERT, using the output vector embedding for prediction. However, extending these methods to new domains necessitates labeling new datasets with domain-specific ontologies. Existing tools like ChemDataExtractor, ChemSpot, and ChemicalTaggers focus on NER for material entities, but lack general-purpose capabilities for extracting diverse material properties, particularly within the complex domain of polymers, which present challenges due to naming variations and the difficulty in converting names to SMILES strings for machine learning model training. Previous efforts have largely concentrated on inorganic materials and small organic molecules. This study aims to build a general-purpose pipeline for extracting material property data from a large corpus of polymer literature, addressing the challenges of polymer naming variations and the lack of generalizable property extraction methods. By focusing initially on abstracts (due to the tractability of associating property-value pairs with materials within abstracts), this pipeline leverages a trained NER model built upon a domain-specific language model (MaterialsBERT), and heuristic rules to combine NER predictions for extracting material property records.

Literature Review

Several existing tools focus on specific aspects of information extraction in materials science. ChemDataExtractor has been successfully used to build databases of magnetic properties and train property prediction models. Other work used semi-automated approaches combining automated crawling with manual extraction by students. Word embedding approaches have also been used to generate entity-rich documents for annotation and subsequent model training. However, most previous NLP efforts in materials science have concentrated on inorganic materials and organic small molecules, leaving a gap in the extraction of information from the polymer literature, a complex domain characterized by variations in the naming of polymers and the difficulty in obtaining SMILES strings representing their chemical structures. The extraction of material property data from literature has primarily been approached using keyword search methods or regular expressions, limiting the scope to specific properties. The current study addresses the broader need for general-purpose extraction of material property information, particularly within the context of polymers, across a large corpus of literature. The creation of extensive, text-mined datasets has enabled significant discoveries, as seen in studies leveraging literature-extracted data for zeolites to analyze interzeolite relationships. Word embeddings trained on such corpora have also been used to predict novel materials.

Methodology

This research developed a general-purpose pipeline for extracting material property data. Starting with a corpus of 2.4 million materials science papers, 750 polymer-relevant abstracts were annotated using a custom ontology encompassing eight entity types: POLYMER, POLYMER_CLASS, PROPERTY_VALUE, PROPERTY_NAME, MONOMER, ORGANIC_MATERIAL, INORGANIC_MATERIAL, and MATERIAL_AMOUNT. This ontology was designed for extracting information from materials science literature; it didn't use the BIO tagging scheme but instead labeled tokens belonging to the ontology as 'OTHER'. The 750 annotated abstracts were used to train a named entity recognition (NER) model, utilizing the MaterialsBERT language model (trained from PubMedBERT using 2.4 million materials science abstracts) for encoding input text into vector representations. The trained NER model was then applied to polymer abstracts, and heuristic rules combined the model's predictions to obtain material property records. The pipeline was evaluated using precision, recall, and F1-score, considering a complete entity prediction as a true positive. Various BERT-based language models were compared using the PolymerAbstracts dataset and publicly available materials science NER datasets to evaluate performance. The MaterialsBERT model showed superior performance, outperforming PubMedBERT, MatBERT, BioBERT, and ChemBERT in three out of five materials science NER datasets. The architecture of the NER model consisted of a BERT-based encoder generating contextual embeddings for input tokens, fed into a linear layer with a softmax nonlinearity predicting the entity type probability. The cross-entropy loss function was used during training. The pipeline extracted ~300,000 material property records from ~130,000 abstracts in 60 hours using a single GPU. The extracted data was analyzed for various applications (polymer solar cells, fuel cells, and supercapacitors) to identify trends and phenomena.

Key Findings

The MaterialsBERT language model, trained on 2.4 million materials science abstracts, outperformed other BERT-based models in three of five materials science NER datasets. The pipeline extracted approximately 300,000 material property records from about 130,000 polymer-relevant abstracts in 60 hours, a comparable quantity to existing manually curated databases. Analysis of the extracted data revealed several known trends. For example, the glass transition temperature (Tg) analysis showed polyamides and polyimides having higher Tg than other polymer classes, while the molecular weight data showed a power-law distribution consistent with synthesis difficulties associated with longer chains. Electrical conductivity analysis showed polyimides having lower values, consistent with their use as insulators, and higher tensile strength compared to other polymer classes. Mechanical property analysis of films revealed the strength-ductility trade-off, indicating that materials with high tensile strength have lower elongation at break and vice versa. The analysis of polymer solar cell data showed similar trends in the extracted data and manually curated data, revealing relationships between power conversion efficiency, short circuit current, fill factor, and open circuit voltage. Fuel cell analysis showed that the extracted data aligns with known physics, displaying the expected linear relationship between power density and current density. Supercapacitor analysis generated an updated Ragone plot reflecting recent advancements in energy densities. Analysis also revealed temporal trends in the literature, showing a shift from fullerene to non-fullerene acceptors in polymer solar cells, with a corresponding increase in power conversion efficiency in recent years. The number of data points extracted by the pipeline has exhibited exponential growth, mirroring the overall growth in publications, though impacted by the COVID-19 pandemic in 2020 and 2021.

Discussion

The NLP pipeline successfully extracted a significant amount of material property data, enabling granular literature searches and revealing non-trivial insights. The findings address the limitations of manual data curation and provide a scalable approach to materials data extraction. The analysis of various applications demonstrated the pipeline's capacity to uncover known material science trends and relationships from disparate sources. The availability of the extracted data through polymerscholar.org allows the broader research community to access this valuable resource. The release of the MaterialsBERT language model further contributes to the advancement of NLP in materials science.

Conclusion

This research presents a general-purpose pipeline for automatically extracting material property data from polymer literature abstracts. The pipeline, powered by the MaterialsBERT language model, efficiently extracted a substantial dataset, revealing known materials science trends and relationships. The extracted data and MaterialsBERT model are publicly available resources for the research community. Future work will focus on expanding the pipeline to include the full text of publications, improving entity co-referencing, and developing robust methods for converting polymer names to SMILES strings to facilitate the training of predictive machine learning models for material properties.

Limitations

The current pipeline focuses on data extraction from abstracts, limiting the scope of information captured. The reliance on heuristic rules for relation extraction might introduce errors. Converting polymer names to SMILES strings for machine learning model training currently requires manual intervention. The ontology used might not capture all relevant metadata, such as processing and measurement conditions. The COVID-19 pandemic appears to have affected the amount of data extracted in 2020 and 2021.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Psychology

Predicting recurrent chat contact in a psychological intervention for the youth using natural language processing

S. Hornstein, J. Scharfenberger, et al.

Psychology

A critical evaluation of QIDS-SR-16 using data from a trial of psilocybin therapy versus escitalopram treatment for depression

B. Weiss, D. Erritzoe, et al.

Health and Fitness

Modifications of 24-h movement behaviors to prevent obesity in retirement: a natural experiment using compositional data analysis

K. Suorsa, N. Gupta, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny