Computer Science
Accelerating materials language processing with large language models
J. Choi and B. Lee
Discover how Jaewoong Choi and Byungju Lee leverage generative pre-trained transformers to revolutionize materials language processing. Their innovative approach not only overcomes traditional challenges but also achieves remarkable results in various knowledge-intensive tasks with minimal dataset requirements.
~3 min • Beginner • English
Introduction
The study addresses the challenge of extracting structured, reliable knowledge from the rapidly growing materials science literature using materials language processing. Traditional deep learning approaches for MLP often require large, carefully annotated datasets and extensive model fine-tuning, which are costly and time-consuming. The authors propose leveraging large language models, specifically GPT-3/3.5/4, to perform core MLP tasks with minimal labeled data via embeddings, zero-shot and few-shot prompting, and targeted fine-tuning. The work focuses on text classification and information extraction (named entity recognition and extractive question answering), aiming to demonstrate performance, reliability, and practicality, and to provide guidelines enabling materials scientists to apply LLMs effectively without deep ML expertise.
Literature Review
Prior MLP efforts have shown that text embeddings can capture domain knowledge such as periodic table relationships. In materials science, text classification has been used to filter relevant documents and identify paragraphs of interest, with methods ranging from topic models and classical machine learning to BERT-based classifiers. Information extraction, especially NER, has been advanced by domain-adapted language models including SciBERT, MatBERT, MatSciBERT, and MaterialsBERT, enabling extraction of entities related to solid-state materials, doping, nanoparticles, polymers, electrocatalysis, and fuel cells. Extractive QA has seen comparatively less attention in materials literature, though battery-related QA models have been explored. These prior works highlight effectiveness but also dependence on large labeled corpora and complex architectures, motivating exploration of GPT-based approaches that can operate with limited supervision.
Methodology
Overview and workflow: The MLP workflow comprises data collection, preprocessing, text classification, information extraction, and downstream data mining. This work concentrates on text classification and information extraction using GPT models via zero-shot, few-shot, and fine-tuning strategies.
Text classification dataset and models: A binary classification dataset of 46,663 papers labeled battery vs non-battery (split 70/20/10) was used, where previous BERT-based models achieved state-of-the-art performance. The authors evaluated: (1) zero-shot classification using GPT embeddings (text-embedding-ada-002) by computing similarity between text and label descriptors in embedding space; (2) few-shot prompting with GPT-3.5 (text-davinci-003) and GPT-4 (gpt-4-0613) using 2-way 1-shot and 2-way 5-shot examples; and (3) fine-tuning GPT-3 (davinci) with 1000 prompt-completion examples for binary classification. They also assessed model calibration via expected calibration error (ECE) using log probabilities.
Zero-shot label engineering: Multiple negative label formulations were tested to optimize separation in embedding space: original labels battery vs non-battery; crude labels battery materials vs diverse domains; designated labels battery materials vs medical and psychological research; and verbose labels papers related to battery energy materials vs medical and psychological research.
NER datasets and GPT setups: Three datasets were used. (1) Solid-state materials: 800 annotated abstracts with categories MAT, SPL, DSC, PRO, APL, SMT, CMT. Each category was handled by fine-tuning GPT-3 (davinci) using prompt-completion pairs created by inserting special tokens around target entities in completions while keeping prompts as raw text. (2) Doped materials: 450 annotations labeled BASEMAT, DOPANT, DOPMODQ, split 8:1:1; fine-tuned GPT-3 (davinci) similarly. (3) AuNPs morphologies: small dataset (N=85) with DES and MOR categories; applied few-shot prompting with GPT-3.5 (text-davinci-003) using three retrieval strategies for example selection: random retrieval, task-informed random retrieval (explicit instruction in prompt), and kNN retrieval of similar examples based on document embeddings or TF-IDF similarity; also evaluated GPT-4 prompts.
Extractive QA dataset and GPT setups: A battery device QA dataset with questions about anode, cathode, and electrolyte and associated contexts was curated. The public set (427 annotations) was refined to 331 QA items after removing duplicates and incorrect/unanswerable annotations, guided by GPT outputs (final counts: anode 90, cathode 161, electrolyte 80). Models evaluated included zero-shot GPT-3.5 (text-davinci-003), task-informed zero-shot with explicit extract instruction, fine-tuned GPT-3 (davinci), and reproduced BERT-family baselines for comparison. QA performance was measured via token-level precision, recall, and F1.
Implementation details: Experiments used the OpenAI Python library. For fine-tuning, base GPT-3 models (ada, babbage, curie, davinci) were available; davinci was used. Prompts used suffix markers (e.g., a delimiter) to signal completion start/end; completions start with a whitespace token for efficient tokenization. Few-shot prompts combined a task description, a handful of labeled examples, and the input. kNN retrieval of few-shot examples used TF-IDF similarity. Inference hyperparameters were configured for deterministic extraction: temperature 0, max tokens sized to training completions (e.g., 256), and appropriate top-p settings. Calibration used ECE with 10 bins, comparing average confidence to empirical accuracy across bins.
Evaluation: Text classification metrics included accuracy, precision, and recall. NER used exact or boundary-relaxed matching to compute precision, recall, and F1 by comparing predicted and gold entity spans. QA used token-level precision/recall averaged across samples. Calibration reported ECE; lower values indicate better alignment between confidence and accuracy.
Key Findings
Text classification:
- Zero-shot GPT embeddings with label engineering improved markedly as labels became more semantically distinct: original labels battery vs non-battery achieved 63.2% accuracy and precision; crude labels battery materials vs diverse domains reached 87.3% accuracy, 84.5% precision, 97.9% recall; designated labels battery materials vs medical and psychological research yielded 91.0% accuracy, 88.6% recall, 98.3% precision; verbose labels papers related to battery energy materials vs medical and psychological research achieved 93.0% accuracy, 90.8% precision, 98.9% recall.
- Few-shot GPT-3.5 (text-davinci-003) 2-way 1-shot achieved 95.7% accuracy; 2-way 5-shot achieved 96.1% accuracy, 95.0% precision, 99.1% recall. GPT-4 (gpt-4-0613) 5-shot slightly improved precision (0.954) and accuracy (0.963) over GPT-3.5.
- Fine-tuned GPT-3 (davinci) reached 96.6% accuracy, 95.8% precision, 98.9% recall, approaching prior SOTA BERT results, with far less labeled data.
- Calibration (ECE, lower is better): SOTA BatteryBERT-cased had 0.03; GPT 2-way 1-shot 0.05; 2-way 5-shot 0.07; fine-tuned GPT 0.07, indicating good calibration for GPT approaches (<0.1).
NER:
- Solid-state materials dataset: GPT-3 (davinci) fine-tuned per category achieved higher precision across categories and superior F1 than the SOTA (MatBERT-uncased) in all categories (MAT, SPL, DSC, PRO, APL, SMT, CMT). GPT showed higher recall for CMT, SMT, SPL and slightly lower recall for DSC, MAT, PRO, but overall higher F1 due to strong precision.
- Doped materials dataset: GPT-3 (davinci) fine-tuning produced high precision but moderate recall. Precision: BASEMAT 93.4, DOPANT 95.6, DOPMODQ 92.7; Recall: BASEMAT 62.0, DOPANT 64.4, DOPMODQ 59.4. F1 scores: 74.6 (BASEMAT), 77.0 (DOPANT), 72.4 (DOPMODQ), surpassing or approaching SOTA MatBERT-uncased (72, 82, 62).
- AuNPs dataset (N=85): Few-shot GPT-3.5 with random retrieval yielded high recall (DES 63%, MOR 97%) and improved over prior recalls (DES 52%, MOR 64%). Adding task instructions increased precision by roughly 6–11% and recall by 1–2%. kNN retrieval of similar examples maintained recall while boosting precision by 15–20%. Overall, few-shot GPT-3.5 achieved comparable MOR F1 and improved DES F1 relative to SOTA; applying GPT-4 further improved DES and MOR extraction.
Extractive QA:
- Re-evaluated BERT models confirmed BatteryBERT (cased) as prior SOTA with F1 74.48% (precision 77.49%, recall 71.69%).
- Zero-shot GPT-3.5 achieved precision 60.92%, recall 79.96%, F1 69.15%.
- Task-informed zero-shot improved to precision 72.89%, recall 80.65%, F1 76.57%.
- Fine-tuned GPT-3 (davinci) achieved precision 88.07%, recall 88.35%, F1 88.21%, outperforming SOTA.
- GPT models identified annotation errors and unanswerable questions, enabling curation of the QA dataset down to 331 high-quality items (anode 90, cathode 161, electrolyte 80).
Discussion
The results demonstrate that GPT-based strategies can deliver high-performance MLP with minimal labeled data. For text classification, careful label engineering in the embedding space enables strong zero-shot performance, and adding a handful of examples yields near-SOTA accuracy and robust calibration, making GPT approaches practical for document filtering and paragraph selection without extensive annotation. For NER, fine-tuned GPT models achieve higher precision and improved F1 compared to BERT-based SOTA across solid-state categories, while few-shot prompting with similar example retrieval substantially boosts performance on small datasets such as AuNPs. The precision–recall trade-off observed in doped materials highlights generative GPT strengths in precise extraction but some difficulty with long or ambiguous entities, consistent with architectural differences from CRF-based sequence taggers. For extractive QA, explicit task-oriented prompting and modest fine-tuning allow GPT to surpass specialized BERT baselines, and importantly, the generative reasoning helps detect and correct misannotations, improving data quality. Collectively, these findings address the core goal of reducing annotation burdens and complexity in MLP while maintaining accuracy and reliability, thereby broadening accessibility of advanced text mining to materials scientists.
Conclusion
The study presents a GPT-enabled pipeline for materials language processing that effectively tackles text classification, named entity recognition, and extractive question answering with limited supervision. Key contributions include: (1) zero-shot and few-shot text classification with high accuracy and good calibration via embedding-based label design and example prompting; (2) NER methods using fine-tuned GPT and few-shot prompts with similar example retrieval that match or exceed BERT-based SOTA F1, especially improving precision; and (3) extractive QA that, with task-informed prompts and fine-tuning, outperforms prior specialized models and aids in correcting mislabeled data. These outcomes can reduce manual labeling workloads and lower barriers for domain researchers.
Future directions include extending GPT-based approaches to more complex, relation- and event-centric IE tasks, developing procedures to ensure faithfulness to source text, and refining prompt design and example retrieval strategies. Ongoing benchmarking and monitoring will be important as proprietary LLMs evolve.
Limitations
GPT models are generative and may produce confident but unfaithful outputs, necessitating post hoc verification against source text, particularly in information extraction. As closed, evolving systems, GPT model performance and behavior may drift over time, requiring continuous monitoring and recalibration. GPT approaches showed relatively lower recall on long or ambiguous entities compared to CRF-augmented BERT models. Extending to more complex, domain-specific tasks such as relation extraction and event detection is challenging because conveying intricate patterns and constraints purely via prompts is difficult. Results may also reflect biases present in training data.
Related Publications
Explore these studies to deepen your understanding of the subject.

