Chemistry
Structured information extraction from scientific text with large language models
J. Dagdelen, A. Dunn, et al.
Unlock the potential of complex scientific knowledge extraction with a simple approach to joint named entity recognition and relation extraction, developed by John Dagdelen and colleagues. Explore how pre-trained language models can be fine-tuned to catalog extensive databases from research papers in materials chemistry.
~3 min • Beginner • English
Introduction
The paper addresses the challenge of transforming unstructured scientific text into structured knowledge suitable for downstream use (e.g., databases, supervised learning, knowledge graphs). Much of materials science knowledge is dispersed across text, tables, and figures, making it hard to leverage systematically. Prior work has focused largely on named entity recognition (NER), which identifies entities (e.g., materials, properties) but often lacks robust relation extraction (RE) to link entities. Traditional pipeline methods (NER followed by relation classification) can work on general corpora but struggle to model complex, hierarchical relationships prevalent in materials science, where properties depend on composition, structure, morphology, processing history, and environment. Enumerating all possible n-ary relations is impractical. Large language models (LLMs) excel at sequence-to-sequence tasks and could enable end-to-end extraction of entities and their interrelations in flexible output formats. The study’s purpose is to fine-tune LLMs to jointly extract named entities and their relationships from scientific text into predefined structured schemas (e.g., JSON), enabling accessible, accurate creation of specialized knowledge bases without extensive NLP expertise.
Literature Review
The authors survey advances in materials-focused NLP, noting effective NER for synthesis parameters, experimental results, and spectral data using models like regex, BiLSTM, and BERT. However, RE—especially for complex, hierarchical relations—remains underexplored in materials science compared to general domains (e.g., people-organization-location-date linking). Prior pipeline RE methods perform document-level RE on general and biomedical corpora but are limited for intricate n-ary relations. Recent joint NERRE seq2seq approaches output tuples with predefined relations but still face limitations for complex hierarchies. In materials science, QA-based extraction (e.g., BatteryBERT) and prompt-engineering with ChatGPT have shown promise but are constrained by device-level assumptions, pretraining scale, or lack of structured hierarchical outputs. Domain-specific LLMs (e.g., Darwin series) demonstrate broad capability but not specifically hierarchical relation extraction beyond pretraining corpora. This work positions fine-tuned LLMs (GPT-3, Llama-2) as end-to-end NERRE engines that output structured, flexible representations (e.g., JSON) and integrate normalization/error correction, filling a gap in practical, generalizable scientific information extraction.
Methodology
The authors propose LLM-NERRE, a sequence-to-sequence approach that fine-tunes pretrained LLMs (GPT-3 and Llama-2) to output structured summaries in user-defined schemas from scientific text passages. Three tasks are designed: (1) solid-state impurity doping (sentence-level), (2) metal-organic frameworks (MOF) information extraction (abstract-level), and (3) general materials information extraction (abstract-level). Schemas include JSON (for all tasks) and English-sentence formats (for doping). For doping, entities are hosts and dopants with host–dopant relations; an extended schema (DopingExtra-English) also includes modifiers and results. For general materials, each JSON object represents a material with fields: formula, name, acronym, descriptions, structure_or_phase, and applications. For MOFs, each JSON object includes mof_name/name or formula as root plus guest species, applications, and descriptions. Training data consist of 400–650 manually annotated prompt–completion pairs per task. A human-in-the-loop process accelerates annotation: partially fine-tuned models prefill outputs that annotators correct, iteratively expanding the dataset. Fine-tuning details: GPT-3 (davinci) trained via OpenAI API (temperature 0, max tokens 512 for doping, 1024 for general/MOF; 7 epochs for doping, 4 for general/MOF; learning rate multiplier 0.1, prompt loss weight 0.01). Llama-2 70B is fine-tuned with LoRA (r=8, α=32, dropout 0.05), 7 epochs (doping) or 4 (general/MOF), greedy decoding, max tokens 512/1024 respectively, on a single A100 80GB GPU. Evaluation employs two metrics: (1) exact word-match (strict) relation F1 for selected relations, and (2) manual scoring by a domain expert allowing normalization/error correction and assessing correct grouping with root entities. Baselines include seq2rel (PubMedBERT-based) and MatBERT-Proximity (NER plus co-sentence linking). Datasets: doping—413 training sentences (from 1215, regex-filtered) and a 77-sentence test set; general—~634 annotated abstracts with 5× random splits; MOF—507 annotated abstracts with 5× random splits.
Key Findings
- On strict exact-match relation F1 (Table 2):
- Doping (host–dopant): Llama-2/Doping-JSON achieved precision 0.836, recall 0.807, F1 0.821 (best overall for the task); GPT-3 scored F1 0.726.
- General task (formula relations, GPT-3): best F1s were formula–application 0.537, formula–structure/phase 0.482, formula–name 0.456, formula–acronym 0.333, formula–description 0.354. Llama-2 was typically 20–30% lower than GPT-3 on these relations.
- MOF task (name relations, GPT-3): name–guest species F1 0.616, name–application F1 0.573, name–formula F1 0.483, name–description F1 0.404.
- Manual scoring for the General-JSON task (10% random test subset, GPT-3) shows substantially higher performance (Table 3): formula F1 0.943; name F1 0.818; applications F1 0.832; structure/phase F1 0.829; description F1 0.704; acronym F1 0.500, indicating exact-match metrics are conservative lower bounds.
- Schema comparison for doping (Table 4): All LLM-NERRE models outperform baselines. Best exact-match F1s: Llama-2/Doping-JSON 0.821; Llama-2/Doping-English 0.814; GPT-3/DopingExtra-English 0.809; GPT-3/Doping-English 0.778; GPT-3/Doping-JSON 0.725. Baselines: MatBERT-Proximity F1 0.390; seq2rel F1 0.496.
- Human-in-the-loop annotation: Using intermediate models prefilled with n training samples reduces annotation time; at n=300, average annotation time per abstract decreased by 57% versus n=1 and per-token time reduced by ~60%, approaching verification-only time.
- Learning curve (GPT-3/Doping-English): Output structure becomes reliably learned around ~20 training samples; below ~10 samples, outputs are often unstructured, yielding near-zero scores.
- Normalization/error correction: The General-JSON model learns to correct and normalize entities (e.g., fixing spaced formulae like "Li Co O2" to "LiCoO2"; standardizing descriptors such as "Pt-functionalized"). Precision for most categories (except acronyms) is ~0.87 or higher in manual scoring, indicating extracted relations are largely correct when present.
Discussion
The findings demonstrate that fine-tuned LLMs can accurately and flexibly extract complex, hierarchical scientific knowledge directly into user-defined schemas, addressing the limitations of traditional pipeline or tuple-encoding RE methods. High manual scores—especially for formula, applications, structure/phase, and descriptions—show that models capture correct relations even when exact phrasing differs, validating the approach for practical database construction. Performance surpasses strong baselines while requiring only hundreds of annotated examples and minimal NLP expertise. The method enables embedded normalization and error correction within the extraction process, reducing post-processing. Human-in-the-loop annotation with partially trained models significantly accelerates dataset creation, making large-scale curation feasible. Model differences suggest GPT-3 excels with natural-language schemas while Llama-2 prefers JSON for the doping task, and GPT-3 offers advantages on more complex abstract-level tasks. Overall, the approach provides an accessible route for domain experts to transform literature into structured, machine-usable knowledge across materials chemistry and potentially other scientific domains.
Conclusion
Fine-tuning large language models to output structured schemas enables joint extraction of entities and their relationships from scientific text, producing accurate, normalized, and parseable records with relatively small, task-specific training sets. The approach outperforms pipeline and seq2seq baselines on diverse materials tasks (doping, MOFs, general materials) and supports human-in-the-loop annotation to rapidly expand datasets. By treating the LLM as a black box that maps text to structured outputs, domain specialists can build large relational databases without extensive NLP expertise. Open-source weights and code (e.g., Llama-2 LoRA) further enhance reproducibility and self-hosting. Future work includes mitigating remaining hallucinations and formatting failures, increasing token limits for longer passages, expanding domain coverage and schemas (e.g., additional entity types), and advancing open, cost-effective deployment options to balance ease-of-use and control.
Limitations
- Output formatting is not strictly enforced, occasionally yielding unparsable sequences (e.g., truncated JSON), especially with long inputs near token limits (512–1024 tokens used here).
- Token limits can truncate complex or dense abstracts, reducing parse success; larger context windows help but may still constrain some documents.
- Hallucinations: Models sometimes infer names or formulae not explicitly present in the text. While normalization can be beneficial, strict extraction may require enforcing word-for-word occurrence.
- Choice of LLM involves tradeoffs: proprietary APIs (ease-of-use) vs self-hosting (control, reproducibility, cost). Running and fine-tuning large open models demands substantial expertise and resources.
- Acronym extraction is weaker due to rarity and ambiguity (overlap with valid chemical formulae), suggesting a need for more training examples or disambiguation strategies.
- Exact-match evaluation underestimates true performance when normalization or equivalent phrasings occur, complicating automated benchmarking.
Related Publications
Explore these studies to deepen your understanding of the subject.

