Engineering and Technology
Extracting accurate materials data from research papers with conversational language models and prompt engineering
M. P. Polak and D. Morgan
Discover ChatExtract, a groundbreaking method crafted by authors Maciej P. Polak and Dane Morgan for automating the precise extraction of data from research papers using cutting-edge conversational large language models. This innovative approach achieves remarkable accuracy and showcases the potential to revolutionize data extraction in materials science.
~3 min • Beginner • English
Introduction
Automated data extraction is increasingly used to develop databases in materials science and other fields. Prior approaches based on NLP and earlier LMs often require extensive setup such as parsing rules, fine-tuning or retraining, and significant domain expertise. The emergence of broadly capable conversational LLMs (e.g., ChatGPT) creates opportunities for accurate, zero-shot extraction with minimal initial effort. This paper asks whether engineered conversational prompting can enable flexible, accurate, and efficient extraction of materials property triplets (Material, Value, Unit) while overcoming common LLM issues like hallucination and relation errors. The work introduces ChatExtract, a prompt-engineered workflow leveraging conversational memory and redundancy to achieve high precision and recall, and demonstrates its utility by constructing databases for critical cooling rates of metallic glasses and yield strengths of high-entropy alloys.
Literature Review
Many databases in materials science have been created via NLP and LMs, with recent LLMs enabling more accurate extraction of complex data. However, traditional methods often demand considerable setup, fine-tuning, and training data preparation. Prompt engineering is established in image generation and has been shown to improve LLM reasoning accuracy. Prior materials text-mining tools such as ChemDataExtractor2 (CDE2) provide rule-based/specifier approaches but can suffer lower recall and precision for certain properties. Recent works also explore fine-tuned LLMs and model-agnostic extraction pipelines. This study builds on these advances by using zero-shot conversational LLMs with engineered prompts and follow-up questions, emphasizing uncertainty induction and redundancy to reduce hallucinations and improve factual correctness without fine-tuning.
Methodology
ChatExtract is a zero-shot, prompt-engineered workflow for extracting triplets (Material, Value, Unit) from research articles using a conversational LLM. The pipeline consists of: (1) Input and preprocessing: gather papers, remove metadata and html/xml syntax, and split into sentences. (2) Stage A (Relevancy): apply a binary Yes/No prompt to each sentence to determine if it contains the target property value and unit. Because relevant sentences are rare (~1:100), early filtering is crucial. (3) Passage expansion: for sentences classified as relevant, expand the context to a passage of three parts: title, preceding sentence, and the sentence itself. This helps capture material names that may be mentioned just prior or in the title while keeping the passage short to preserve precision. (4) Stage B (Extraction): first determine if the passage contains more than one value of the property (multi-valued vs single-valued). For single-valued passages, ask three direct questions to extract value (number only), unit (unit only), and material name (name only), explicitly allowing for "None" to discourage hallucinations. If any response is "None", discard the passage. For multi-valued passages, ask the model to produce a structured table (Material, Value, Unit) using only the given text, explicitly allowing for absent data. Then apply a series of strict Yes/No follow-up validation prompts for each row and field to verify that the extracted material is valid and present, the unit is correct for the nth item, and the value matches the nth material. If any answer is "No", discard that row. The workflow enforces: splitting single vs multi-valued texts; explicit allowance for missing data to reduce hallucination; uncertainty-inducing redundant follow-up questions; embedding all questions within a single conversation while repeating the analyzed text in each prompt to maintain attention to details; and strict Yes/No formats to enable automation. Additional modules handle tables and figures: for tables, classify table relevance and then extract only relevant columns into Material, Value, Unit without follow-up (since tables are already structured). For figures, classify captions for relevance and collect figures for later manual extraction. Model settings in experiments used fixed versions (e.g., gpt-3.5-turbo-0301, gpt-4-0314) with temperature=0, top_p=1. Performance was assessed against hand-labeled ground truth using stringent equivalence criteria (identical units and values, and material names uniquely identifying the same system).
Key Findings
Bulk modulus benchmark (100 relevant sentences; 1912 irrelevant with numbers; 179 ground-truth triplets):
- ChatGPT-4 (gpt-4-0314): Overall P=90.8%, R=87.7%; Single-valued P=100%, R=100%; Multi-valued P=100%, R=82.7%.
- ChatGPT-3.5 (gpt-3.5-turbo-0301): Overall P=70.1%, R=65.4%; Single-valued P=100%, R=88.5%; Multi-valued P=97.3%, R=55.9%.
- LLaMA2-chat (70B): Overall P=61.5%, R=62.9%; Single-valued P=74.1%, R=87.7%; Multi-valued P=87.3%, R=53.5%.
Ablations:
- No follow-up prompts: GPT-4 Overall P=42.7%, R=98.9%; GPT-3.5 Overall P=26.5%, R=78.2% (precision collapses due to hallucinated data in non-relevant passages).
- No chat (new conversation each prompt, GPT-3.5): Overall P=70.0%, R=54.7% (recall drops significantly without conversational memory).
- CDE2 on bulk modulus passages: Overall P=57%, R=31%.
Tables and figures:
- Tables: From 163 tables, 58 manually classified as relevant, yielding 500 manual datapoints. Table classification: P=95%, R=98%. Per-datapoint extraction: P=91%, R=89%. Some errors due to malformed XML-to-text conversions.
- Figures: From 436 figures, 45 manually classified relevant. Figure caption classification: P=80%, R=82% (some misses where bulk modulus appears as fitted parameter not explicitly stated in captions).
Critical cooling rate (CCR) for metallic glasses:
- Manual ground truth raw database Rc1: 721 entries. ChatExtract Rc2: 634 entries; Precision=76.9%, Recall=63.1% (raw). Cleaned: P=78.1%, R=64.3% (637 vs 553 entries compared). Standardized: P=91.9%, R=84.2% (313 vs 286 entries compared). Additional from tables: 348 raw table datapoints added 277 new standardized points and 97 new unique compositions. 208 figures classified relevant.
- Final standardized database: 557 datapoints; 309 unique triplets across papers; 222 unique compositions (many duplicates for Zr41.2Ti13.9Cu12.5Ni10Be22.5, Vit1). Values ranged from 103 Ks⁻1 (Ni40P20Zr40) to 4.619×10⁻3 Ks⁻1 (CuZr2). After removing non-metallic oxides: 298 unique datapoints for 217 unique compositions. Significantly larger than prior hand-curated datasets (e.g., 77 entries).
- CDE2 on CCR: Overall P=49.2%, R=35.1%.
Yield strength of high entropy alloys (HEAs):
- Search returned 4029 papers (840,431 sentences). Extracted 10,269 raw datapoints; 8,900 cleaned; 4,275 standardized. Restricting to HEAs (≥5 elements) yields 2,442 standardized-HEA datapoints with 636 unique compositions. Values range from 12 MPa (Al0.4Co1Cu0.6Ni0.2Si0.2) to 19.16 GPa (Fe-Cr31Ni23Co34Mn5). Distribution peaks ~400 MPa, physically reasonable. From tables: 2,456 raw datapoints added 195 new unique HEA compositions. 1,848 figures classified relevant.
Discussion
The results show that a carefully engineered conversational prompting workflow can achieve high precision and recall for zero-shot extraction of Material–Value–Unit triplets. Redundant, uncertainty-inducing follow-up questions are critical to suppress hallucinations, especially in passages without data, markedly improving overall precision. The conversational memory of chat models helps maintain context across prompts, enhancing recall; removing chat history degrades performance. The approach transfers to different properties and data sources (text, tables, figures) and different models, with best performance from GPT-4 and reasonable results from open models like LLaMA2-chat. The method scales to real-world database construction, producing larger and physically sensible datasets than prior hand-curated efforts for CCR and HEA yield strength. While designed for triplets, the workflow can be generalized to extract multiplets (e.g., adding temperature/pressure) by extending prompts and validations, though accuracy of such generalizations was not assessed here. Overall, ChatExtract addresses the need for low-effort, accurate data extraction and is likely to improve further as LLMs advance.
Conclusion
Conversational LLMs combined with prompt engineering and strict follow-up verification (ChatExtract) can provide high-quality materials data extraction with no fine-tuning and minimal coding. On a challenging bulk modulus benchmark, GPT-4 achieved 90.8% precision at 87.7% recall. On a full CCR database construction, precision and recall reached 91.6% and 83.6%, respectively. Redundant follow-up questions and conversational information retention are key to performance gains. The method enabled construction of substantial databases for CCR of metallic glasses and yield strength of HEAs, including the largest HEA yield strength dataset to date. Given its model-agnostic nature, simplicity, and accuracy, ChatExtract and similar approaches are poised to replace more labor-intensive methods and will likely improve as LLM capabilities grow.
Limitations
Key limitations include: (1) Ambiguities in material naming (families/groups vs exact compositions) and in value types (ranges/limits vs discrete values) reduce precision/recall in raw extraction; standardized databases require post-processing and sometimes manual intervention. (2) The method relies on information present within short text passages (title, preceding sentence, target sentence); required context (e.g., material identity, conditions) may be outside this window, lowering recall. (3) The prompts were optimized primarily for GPT models; other LLMs (e.g., LLaMA2-chat) may need further prompt tuning. (4) Removing follow-ups or conversational memory severely degrades performance, indicating sensitivity to workflow design. (5) Figure data are only identified via captions; full figure data extraction remains an open challenge. (6) Some errors stem from source formatting (e.g., malformed XML-to-text tables) beyond the method’s control. (7) Developing standardized databases from raw outputs currently needs additional prompt engineering, regex processing, and domain tools, which may introduce errors and require effort. (8) Use of proprietary LLMs (e.g., GPT-4) raises concerns about access, stability of versions, and computational cost compared to lighter rule-based tools.
Related Publications
Explore these studies to deepen your understanding of the subject.

