Engineering and Technology
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing
P. Shetty, A. C. Rajan, et al.
Discover how a team of researchers including Pranav Shetty and others developed an automated pipeline that extracts valuable material property data from the ever-growing polymer literature. Using an innovative language model, they processed thousands of abstracts in a matter of hours, yielding insights that could reshape applications like fuel cells and solar technology.
~3 min • Beginner • English
Introduction
The volume of materials science literature is growing at ~6% annually, locking valuable quantitative and qualitative property information in non-machine-readable text. Manually identifying trends or searching for materials with desired properties across large corpora is challenging and contributes to data scarcity in materials informatics. The authors adapt information extraction techniques from NLP—such as named entity recognition (NER), relation extraction, co-reference resolution, and entity normalization—to automatically recover material property information from text. Transformer-based, self-supervised language models (e.g., BERT) are effective backbones for such tasks, enabling contextual embeddings that support downstream NER and relation extraction with supervised labels. Extending NLP methods to new domains requires tailored ontologies and labeled datasets. The work focuses on polymers, which present challenges such as numerous naming variations and the difficulty of mapping names to structures (e.g., SMILES typically inferred from figures rather than text). The authors propose and demonstrate a general-purpose pipeline that starts with a large materials science corpus, uses an ontology-driven NER model powered by a domain-adapted language model (MaterialsBERT), and combines predictions with heuristic rules to extract material property records from abstracts at scale. They analyze the resulting data across diverse applications (polymer solar cells, fuel cells, supercapacitors) and show that known trends emerge from the automatically extracted data, also training an example predictor (glass transition temperature) from the extracted records.
Literature Review
Prior literature provides chemical NER tools and domain-specific NLP systems, including ChemDataExtractor, ChemSpot, and ChemicalTagger, which have enabled automatic construction of databases (e.g., Néel/Curie temperatures, magnetocaloric materials) and semi-automated extraction workflows (e.g., Flory-Huggins parameter in polymers). Word-embedding methods have supported creation of entity-rich documents for annotation and subsequent NER training. Most prior NLP efforts center on inorganic materials and small organic molecules; polymers are less explored and pose unique challenges: extensive name variations requiring normalization and frequent absence of text-to-SMILES convertibility (structures often appear only in figures). Earlier property mining typically targeted specific properties via keyword/regex search rather than general-purpose extraction. Domain-specific BERT variants (e.g., MatSciBERT, MaterialBERT, BioBERT, ChemBERT, MatBERT) have shown benefits on materials NLP tasks; however, few works demonstrate end-to-end, general-purpose extraction of polymer property records at large scale and analysis of the resultant datasets.
Methodology
Corpus and scope: A corpus of ~2.4 million materials science journal articles (HTML) was compiled from major publishers. Abstracts were the focus for extraction due to tractable association of materials and properties. Preprocessing removed HTML tags, mapped superscripts/subscripts to LaTeX-like text, normalized Unicode variants, and later converted scientific notation to floats.
Ontology and annotation: An 8-entity ontology was defined: POLYMER, POLYMER_CLASS, PROPERTY_NAME, PROPERTY_VALUE, MONOMER, ORGANIC_MATERIAL, INORGANIC_MATERIAL, MATERIAL_AMOUNT (all other tokens labeled OTHER). 750 polymer-relevant abstracts (filtered by 'poly' and presence of numerics) were annotated with IO tagging using Prodigy over three rounds with refined guidelines. Pre-annotation dictionaries accelerated labeling. The dataset (PolymerAbstracts) was split 85/5/10 for train/val/test. Inter-annotator agreement on 10 shared abstracts: Fleiss’ κ = 0.885; pairwise Cohen’s κ = 0.906, 0.864, 0.887.
Tokenization and model: WordPiece tokenization was used; only the first subword’s embedding per word was fed to the tagger, as standard. The NER model used a BERT-base encoder to produce contextual token embeddings, followed by a single linear layer with softmax; dropout 0.2; cross-entropy loss; max sequence length 512 (longer sequences truncated). Training used Adam (initial lr 5e-5 with linear decay) and early stopping based on validation F1.
MaterialsBERT pretraining: Starting from PubMedBERT (trained from scratch on ~14M PubMed texts, biomedical vocabulary), the authors continued pretraining on 2.4M materials science abstracts (spanning many subdomains) using the same masked language modeling and next-sentence prediction objectives and hyperparameters. Fine-tuning ran for 3 epochs (~90 hours on four RTX6000 16GB GPUs). The resulting encoder is MaterialsBERT.
Evaluation: NER performance was assessed via precision, recall, F1 at the entity level (correct only if the entire entity span is correctly labeled).
End-to-end extraction pipeline (abstract-level):
- Select polymer-relevant documents by searching for 'poly'.
- Run the trained NER model to label entities.
- Filter abstracts that contain at least one of POLYMER/POLYMER_CLASS/MONOMER and both PROPERTY_NAME and PROPERTY_VALUE.
- Entity processing: extract materials, (PROPERTY_NAME, PROPERTY_VALUE) pairs, MATERIAL_AMOUNT separately.
- Co-reference resolution: detect abbreviations via ChemDataExtractor’s abbreviation tool and merge near-duplicate mentions using Levenshtein distance ≤ 1 (suitable for abstracts without long-range anaphora).
- Polymer name normalization: map variants (e.g., poly(ethylene), poly-ethylene) to a canonical name (e.g., polyethylene) using a dictionary of normalized polymer clusters (most common polymers with known name variants).
- Property parsing: associate PROPERTY_NAME and PROPERTY_VALUE by co-occurrence within a context window; parse numbers/units with regex; convert values to standard units; capture reported standard deviations.
- Material amounts: associate MATERIAL_AMOUNT to the nearest material within a context window.
- Relation extraction (heuristic): associate the closest material entity in the same sentence as a property pair; if none, associate all materials in the abstract with the property pair (reasonable for compact abstracts), yielding material property records.
Code and data availability: MaterialsBERT is hosted on HuggingFace (pranav-s/MaterialsBERT). PolymerAbstracts and pipeline code are available on GitHub (Ramprasad-Group/polymer_information_extraction). Extracted property data can be explored at polymerscholar.org.
Key Findings
Scale and throughput: From ~2.4M articles, ~650k abstracts were polymer-relevant; ~130k of these contained material property data. The pipeline extracted ~300,000 material property records from ~130,000 abstracts in ~60 hours on a single Quadro 16 GB GPU. For context, PoLyInfo contains 492,645 property records curated over years.
NER performance: On PolymerAbstracts, MaterialsBERT achieved Precision 62.5%, Recall 70.6%, F1 66.4%, outperforming PubMedBERT (F1 65.8), MatBERT (65.2), BioBERT (62.6), ChemBERT (57.0), and BERT-base (56.2). Across four public materials NER datasets, MaterialsBERT had top or competitive performance and overall outperformed other BERT variants on three of five datasets considered (PolymerAbstracts plus four public sets); ChemDNER favored PubMedBERT, and MatBERT was slightly higher on Inorganic Abstracts.
Dataset composition (examples from Table 4):
- Molecular weight: 9,053 datapoints (neat polymers 9,053), estimated 2,623 unique neat polymers.
- Glass transition temperature (Tg): 6,155 datapoints (neat/blends/composites 4,612/1,036/507), ~1,732 unique neat polymers.
- Electrical conductivity: 6,030 datapoints (3,202/606/2,222), ~1,017 unique neat polymers.
- Tensile strength: 4,382 datapoints (2,679/651/1,052), ~926 unique neat polymers.
- Elongation at break: 1,499 datapoints (954/234/311), ~413 unique neat polymers.
- Application metrics: polymer solar cells (PCE 3,595; Voc 1,386; Jsc 1,049; FF 966); fuel cells (proton conductivity 1,359; areal power density 1,235; areal current density 295); supercapacitors (gravimetric energy density 1,131; gravimetric power density 898; methanol permeability 174).
Recovered scientific trends and insights:
- General properties by polymer class: Polyamides and polyimides tend to have higher Tg. Polyimides show low electrical conductivity (consistent with insulating use) and higher tensile strengths than other classes. Molecular weight reports skew toward lower values, resembling a power-law distribution.
- Mechanical trade-off: Films exhibit the known strength–ductility trade-off: higher tensile strength correlates with lower elongation at break, and vice versa.
- Polymer solar cells: NLP-extracted PCE correlates linearly with short-circuit current; peak PCE ~16.71% (near state-of-the-art). Open-circuit voltages cluster around ~0.85 V with a Gaussian-like distribution. Trends match those from a manually curated dataset.
- Fuel cells: Areal power density vs. current density aligns along a slope ~0.42 V, reflecting typical operating voltage at max current densities. For direct methanol fuel cells, the desirable region of low methanol permeability and high proton conductivity is identifiable, facilitating candidate selection.
- Supercapacitors (Ragone plot): An up-to-date energy vs. power density landscape shows many data points exceeding 10 Wh/kg, with values up to ~100 Wh/kg predominantly from 2020–2021, reflecting recent advances.
- Temporal trends in solar cells: Fullerene-acceptor papers dominate earlier years and decline recently; non-fullerene acceptors increase over time and achieve higher PCEs in recent years, reproducing known field trends.
Annotation reliability: High inter-annotator agreement (Fleiss’ κ 0.885; Cohen’s κ 0.864–0.906).
Discussion
The study demonstrates that a domain-adapted language model (MaterialsBERT) combined with an ontology-driven NER model and lightweight relation heuristics can automatically extract polymer material property records from abstracts at scale, addressing the challenge of non-machine-readable literature. The pipeline not only achieves competitive NER performance but also yields a large, structured dataset that reproduces known physics and device trends—validating the approach and supporting literature-scale insight generation. Compared to manual curation (e.g., PoLyInfo), the automated pipeline rapidly amasses a comparable volume of records, reducing human effort to targeted curation and quality control. The extracted data enable more granular search and discovery (e.g., identifying desirable property regions for devices) and can seed machine learning property predictors, as shown by an example Tg predictor trained on extracted data. The web interface (polymerscholar.org) further translates the pipeline’s outputs into a practical tool for researchers.
Conclusion
This work introduces a general-purpose, end-to-end pipeline for extracting material property records from polymer literature abstracts. Key contributions include: (1) MaterialsBERT, a domain-adapted BERT encoder trained on 2.4M materials abstracts, outperforming baselines on multiple NER datasets; (2) PolymerAbstracts, an ontology and annotated dataset enabling polymer-focused NER; (3) a scalable extraction system that produced ~300k records from ~130k abstracts in ~60 hours; and (4) demonstration that the extracted data captures meaningful materials trends across general properties and applications (polymer solar cells, fuel cells, supercapacitors). The data are accessible via polymerscholar.org, and code/models are openly released. Future work includes extending extraction to full-text (text, tables, figures) with robust cross-sentence and cross-modal co-reference and relation extraction; enriching the ontology to capture processing and measurement metadata; automated conversion of polymer structures from figures to SMILES for structure-aware modeling; and integration of continuously extracted data with machine learning pipelines for property prediction and materials design.
Limitations
- Current extraction is limited to abstracts; co-referencing across long spans, figures, and tables and cross-sentence relation extraction remain challenging for full-text.
- The ontology focuses on core entities (materials and property name/value), omitting important metadata such as processing conditions, measurement methods, and measurement conditions that can influence reported values.
- Relation extraction uses heuristics rather than supervised relation models; while effective for abstracts, it may misattribute properties in more complex texts.
- Polymer name normalization relies on a dictionary covering common polymers; less common names may remain unnormalized.
- Mapping polymer names to chemical structures (e.g., SMILES) typically requires manual effort; automated, robust image-to-structure conversion for polymers is an open challenge.
- Despite automation, human curation is still needed for quality assurance and to handle edge cases and noise in text-mined data.
Related Publications
Explore these studies to deepen your understanding of the subject.

