ChemistryNature Communications

Structured information extraction from scientific text with large language models

J. Dagdelen, A. Dunn, et al.

Unlock the potential of complex scientific knowledge extraction with a simple approach to joint named entity recognition and relation extraction, developed by John Dagdelen and colleagues. Explore how pre-trained language models can be fine-tuned to catalog extensive databases from research papers in materials chemistry.... show more

General Summary Metrics

Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pre-trained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Publisher

Nature Communications

Published On

Feb 15, 2024

Authors

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, Anubhav Jain

DOI

https://doi.org/10.1038/s41467-024-45563-x

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Environmental Studies and Forestry

Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature

A. V. Gougherty and H. L. Clipp

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Computer Science

Accelerating materials language processing with large language models

J. Choi and B. Lee

Computer Science

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Y. Zhang, X. Chen, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny