Chemistry

Structured information extraction from scientific text with large language models

J. Dagdelen, A. Dunn, et al.

Unlock the potential of complex scientific knowledge extraction with a simple approach to joint named entity recognition and relation extraction, developed by John Dagdelen and colleagues. Explore how pre-trained language models can be fine-tuned to catalog extensive databases from research papers in materials chemistry.

00:00

Playback language: English

Index

Introduction

The majority of scientific knowledge about solid-state materials is scattered across the text, tables, and figures of millions of academic research papers. This makes it difficult for researchers to fully understand past work and leverage existing knowledge when designing experiments. Machine learning models for direct property prediction are increasingly used as screening steps in materials discovery and design workflows, but these are limited by the amount of training data available in tabulated databases. Databases of materials property data from ab initio simulations are relatively common, but are limited to computationally accessible properties, while databases of experimental property measurements and other useful experimental data are comparatively small (if they exist at all). In recent years, there have been significant advances in applying natural language processing (NLP) algorithms to materials science for structuring textual knowledge. Much of this work has focused on named entity recognition (NER), where labels like "material" or "property" are applied to words in the text. These tagged word sequences can sometimes be used with post-processing to construct auto-generated tabular databases of materials property data aggregated from text entries. Prior information extraction studies have included NER labeling of chemical synthesis parameters in methods sections, quantitative results of battery cycling experiments, or peak absorption wavelengths for UV-Vis experiments, among others. Regular expressions, BiLSTM recurrent neural networks, and smaller transformer-based language models like BERT are sufficient for such tasks, primarily targeting entities rather than relations. A key challenge in scientific NLP is developing robust, simple, and general relation extraction (RE) techniques. Downstream tasks like supervised machine learning or knowledge graph construction require transforming unstructured text into structured relationships between semantic entities. RE models determine which entities are linked by a predefined set of relations. For example, in the sentence "LiCoO<sub>2</sub> is studied as a Li-ion battery material," the material entity "LiCoO<sub>2</sub>" is linked to the application entity "Li-ion battery." Until recently, there has been relatively little work on relation extraction in materials science text, although there has been much research interest in RE on general-purpose text, especially linking people, organizations, locations, and dates. These methods have traditionally relied on pipeline-based approaches where NER is the first step, followed by relation classification. Each step typically uses a separate machine learning model. However, scientific information often can't be modeled as simple pairwise relations. This is particularly true in inorganic materials science, where a compound's properties are determined by a complex combination of composition, atomic geometry, microstructure, morphology, processing history, and environmental factors. Inorganic materials knowledge is often intertwined, with relations only valid between one entity type and a compound entity (itself comprised of several entities and relationships). For example, zinc oxide nanoparticles (composition "ZnO" linked to morphology "nanoparticles") may be a catalyst, but "ZnO" and "nanoparticles" alone are not necessarily catalysts. Relationships between *n* entities can theoretically be modeled as *n*-tuples, but comprehensively enumerating all possibilities is impractical and not amenable to conventional relation extraction methods due to the need for sufficient training examples for each relation type. Current relation extraction models aren't designed to practically extract or preserve such complex, intricately related, and hierarchical relationships; a more flexible strategy is needed.

Literature Review

Large language models (LLMs) like GPT-3/4, PaLM, Megatron, LLAMA 1/2, OPT, Gopher, and FLAN have demonstrated remarkable ability to leverage semantic information between tokens in natural language sequences. They excel at sequence-to-sequence (seq2seq) tasks, where text input (prompts) seeds a text response (completions). Seq2seq applications are broad, including machine translation, answering factual questions, performing simple arithmetic, summarizing text, and chatbot applications. It's reasonable that these models might also be adept at complex scientific information extraction. Recently, end-to-end methods using a single machine learning model for joint NER and RE have been investigated. These methods take a seq2seq approach where a model is trained to output tuples of two or more named entities and the relation label. While these perform well on RE, they remain *n*-ary relation extraction systems unsuitable for highly intricate and hierarchical NERRE. In materials science, Huang & Cole fine-tuned a BERT model on battery publications to enhance a database of NLP-extracted battery data. Their question-and-answer (Q/A) approach extracted limited device-level information (e.g., "What is the cathode?") but couldn't handle passages with information about more than one device and required training on hundreds of thousands of papers. Zheng *et al.* designed a prompt-engineering approach (ChemPrompt w/ ChatGPT) for extracting data into tabular forms. Castro Nascimento and Pimentel examined ChatGPT's general chemistry knowledge, finding it performed poorly without prompting "tricks." Xie *et al.* utilized LLMs fine-tuned on a large materials science corpus for Q/A, inverse design, classification, and regression tasks. While these demonstrate LLMs as materials science knowledge engines, they haven't shown extraction of structured representations of complex hierarchical entity relationships generalizing outside the pretraining corpus.

Methodology

This work investigates a simple approach to complex information extraction where an LLM is fine-tuned to simultaneously extract named entities and their relationships. This method handles complex inter-relations (including lists of multiple items) without requiring enumeration of all possible *n*-tuple relations or preliminary NER. We fine-tune a pretrained LLM (e.g., GPT-3 or Llama-2) to accept a text passage and write a precisely formatted "summary" of the knowledge. This completion can be formatted as English sentences or a structured schema like a list of JSON documents. To use this method, one defines the desired output structure (e.g., a list of JSON objects with predefined keys) and annotates ~100–500 text passages using this format. The LLM is then fine-tuned on these examples, and the resulting model outputs extracted information in the same structured representation. This method uses OpenAI's GPT-3 (closed source) and Llama-2 (open access) for sentence-level and document-level materials information extraction. The method leverages online LLM APIs, allowing users to train bespoke models without extensive LLM knowledge. Researchers can treat the LLM as a black-box transforming passages into precisely-formatted summaries. Intermediate models can pre-suggest entities for annotation, increasing annotation speed. While the examples are from materials science, the method's generality and accessibility imply applicability to other domains. It doesn't require fine-tuning on a large corpus of domain-specific data; the LLM's comprehensive pretraining and user-provided annotations suffice.

Key Findings

The approach was tested on three joint NERRE materials information extraction tasks: solid-state impurity doping, metal-organic frameworks (MOFs), and general materials information extraction. The solid-state impurity task identifies host materials, dopants, and related information. The MOF task identifies chemical formulae, applications, guest species, and descriptions. The general materials task identifies inorganic materials, formulae, acronyms, applications, phase labels, and other descriptive information. The general and MOF models were trained on data including normalization and error correction, while doping models extracted data exactly as it appeared in the text. A comparison between GPT-3 and Llama-2 on NERRE precision, recall, and F1 scores across the three tasks using a JSON schema showed GPT-3 achieving the highest F1 scores for the General and MOF tasks. Exact match F1 scores for these tasks were generally ~30% lower than the host-dopant task. The highest F1 for the general task was for relationships between formulae and applications, while formula-acronym and formula-description relationships were less reliable. Similar findings occurred for the MOF task. Llama-2 scores were 20-30% lower than GPT-3 counterparts, except for the dopant task where Llama-2 performed better. The F1 scores for the general and MOF tasks were seemingly low, but this is due to the fact that these tasks' annotations included implicit normalization and error correction, unlike the doping task. Manual scoring, accounting for flexibility in entity normalization, error correction, and multiple plausible annotations, revealed much higher performance. Manual scores showed that the models correctly extract structured knowledge from scientific text. Acronyms had the lowest scores, attributed to their rarity in the training data and potential confusion with chemical formulae. The models demonstrated high precision in extracting meaningfully complex representations of material knowledge. For the host-dopant extraction task, three output schemas were evaluated: Doping-English (English sentences), DopingExtra-English (English sentences with additional information), and Doping-JSON (JSON objects). All three LLM-NERRE models outperformed baseline models (MatBERT-Proximity and seq2rel). GPT-3 models performed optimally with natural language-like schemas, while Llama-2 performed optimally with JSON. A human-in-the-loop annotation experiment showed that using partially trained LLMs significantly reduced annotation time, particularly with models trained on more data. The minimum number of training examples for models with sentence-format outputs was found to be around 20.

Discussion

The LLM-NERRE method shows excellent performance on three diverse materials science tasks. Its non-technical nature allows scientists without NLP training to extract large structured relational datasets. The method's accessibility and ease-of-use are key advantages, requiring only schema specification, minimal annotation, and fine-tuning via a public API. Error correction and normalization can be embedded in training examples, reducing post-processing. Human-in-the-loop processes decrease training set collection time. Fine-tuning GPT-3/Llama-2 requires fewer examples than BERT-based models. Limitations include the occasional generation of unparsable output sequences, mainly due to exceeding token limits. Hallucinations (generating information not present in the input text) were also observed, mainly adding names or chemical formulae when only one was mentioned. This could be mitigated by requiring all extracted entities to occur word-for-word in the source text, but the models' ability to handle variations is also a useful feature for automatic entity normalization. The choice of LLM involves a trade-off between ease of use (proprietary LLMs like GPT-3) and control (open-access models like Llama-2). Cloud-hosted open-access models may offer a good balance.

Conclusion

This work demonstrates that LLMs fine-tuned on a few hundred training examples can extract scientific information and format it in user-defined schemas. This contrasts with past models that struggled to relate entities meaningfully. The method is simple and accessible, and the Llama-2 LORA weights are available for download, allowing researchers to investigate the method on their own hardware. These advancements will facilitate converting historical scientific text into structured forms.

Limitations

One limitation is that the output schema formatting is not rigorously enforced; the LLM may occasionally produce unparsable sequences, particularly when the inference token limit is low and the schema is JSON. Another limitation is the tendency of LLMs to hallucinate information not present in the input text, such as adding chemical formulae when only the name is mentioned in the source text. The choice of LLM (proprietary vs. open-source) involves a trade-off between ease of use and control and reproducibility.

Related Publications

Explore these studies to deepen your understanding of the subject.

Environmental Studies and Forestry

Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature

A. V. Gougherty and H. L. Clipp

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Computer Science

Accelerating materials language processing with large language models

J. Choi and B. Lee

Interdisciplinary Studies

Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities

M. H. Prince, H. Chan, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny