Interdisciplinary Studies

Opportunities for Retrieval and Tool Augmented Large Language Models in Scientific Facilities

M. H. Prince, H. Chan, et al.

Discover how advanced scientific user facilities are becoming more complex, making experiments a challenge. Learn about the innovative Context-Aware Language Model for Science (CALMS), developed by authors including Michael H. Prince and Henry Chan, which enhances instrument operations and scientific workflows using large language models.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates how large language models (LLMs) augmented with retrieval and tool-use capabilities can assist users of advanced scientific facilities in planning and conducting experiments. The context is the growing complexity of instruments at user facilities (e.g., x-ray light sources, nanoscience centers, and neutron sources), which raises barriers for users to design, configure, and operate experiments effectively. The purpose is to develop and demonstrate a context-aware system (CALMS) that combines LLMs with document retrieval and integration with scientific software/hardware tools to reduce hallucinations, provide accurate facility-specific guidance, and enable semi-autonomous instrument operation. The importance lies in broadening and accelerating scientific discovery by lowering the operational burden on users, aiding experimental design, and improving access to accurate, facility-specific knowledge while acknowledging risks such as hallucinations without proper context.

Literature Review

The paper situates its work within the rapid adoption of LLMs across sectors and their emerging roles in science for literature search, experimental design, data summarization, and writing/editing. Prior studies highlight LLMs’ capability for few-shot learning, materials property prediction, and inverse design. It contrasts closed-source, high-performing models (e.g., GPT-3.5/4) with open-source alternatives (e.g., Vicuna, Llama-family models), noting issues of transparency, ethics, compute requirements, and adaptability via fine-tuning. Community platforms like HuggingFace provide datasets, models, and evaluation leaderboards for open LLMs. Retrieval-Augmented Generation (RAG) is emphasized as a practical approach for injecting domain knowledge without expensive fine-tuning, with surveys and prior work showing gains in knowledge-intensive tasks. Tool-augmented LLMs (e.g., Toolformer, ReAct-style approaches) and applications in robotics demonstrate that structured tool-use and prompting can extend LLMs beyond text reasoning to real-world actions. Benchmarks such as MMLU, HellaSwag, WinoGrande, GSM-8K, and others are referenced to contextualize model capabilities, with reported performance ranges indicating that closed-source models often outperform open-source counterparts, though the gap is narrowing with improved training, instruction tuning, RLHF, and chain-of-thought prompting.

Methodology

The authors develop CALMS (Context-Aware Language Model for Science), comprising: (1) an LLM backend (tested with OpenAI GPT-3.5 Turbo and the open-source Vicuna); (2) a memory component to maintain conversational state; (3) a document store for facility and instrument documentation; and (4) an experiment planning assistant for user-specific guidance. The framework is model-agnostic and supports swapping LLMs. - Context retrieval (RAG): Facility documentation is preprocessed into chunks, embedded using lightweight embedding models (OpenAI embeddings for OpenAI tests; all-mpnet-base-v2 for open-source tests), and stored in a vector database (e.g., ChromaDB). User queries are embedded at runtime; top-k nearest chunks are retrieved and injected into the LLM prompt to overcome context window limits and reduce hallucinations. - Conversational memory: To mitigate limited context windows, CALMS uses a moving window over the conversation history, including the last K=6 user–assistant exchanges to provide short-term memory. - Tool augmentation: CALMS exposes structured tools to the LLM, including calls to the Materials Project API (GetLatticeConstants) and an instrument control interface via SPEC (SetDiffractometer). The system follows Chain-of-Thought and ReAct prompting to select and sequence tool calls. A parser interprets tool-call intents from the LLM output, executes the tool, and feeds results back into the evolving prompt until a final response is produced. Implementation leverages LangChain’s Structured input ReAct, requiring valid JSON tool-call arguments. - Model comparison: The team compares GPT-3.5 Turbo and Vicuna on tasks involving experimental planning assistance, operational guidance, and instrument driving. Responses are graded for relevance (Rel), hallucination presence (Hal), and completeness (Com, 0–5 scale). Tests are run with and without context retrieval to assess the impact of RAG. - Demonstrations: (i) Experimental planning Q&A at facilities such as APS and CNMS. (ii) Operational assistance for starting tomography scans (GUI and CLI paths via tomoscan). (iii) Automated execution: a real diffractometer move on APS beamline 34 where CALMS parses a user query (material + Bragg peak), retrieves lattice constants via Materials Project, computes motor positions via SPEC, and moves the diffractometer accordingly.

Key Findings

- Context-aware responses: With accurate retrieved context, CALMS provided facility-specific, relevant, and largely hallucination-free answers. GPT-3.5 Turbo consistently achieved higher completeness (often Com=5) than Vicuna. - Without context, hallucinations and irrelevancies increased: Examples include fabricated or misattributed tools for CNM image simulation (e.g., “CNM-ImageSim,” “ImageWS,” incorrect reference to “ImageNet”) and off-domain answers (e.g., medical CT procedures instead of APS tomography operations). - Operational guidance: With context, both models gave correct APS tomoscan procedures; GPT-3.5 provided more complete answers, including GUI and CLI usage and pointers to documentation. - Tool use and automation: Using GPT-3.5 with ReAct and structured JSON tool calls, CALMS successfully executed a diffractometer move on APS beamline 34 from a single user prompt by chaining Materials Project lattice retrieval and SPEC diffractometer control. Open-source models struggled to reliably produce valid structured tool calls. - Benchmarking context: Reported ranges from Table 1 include MMLU ≈56.67–70.00%, HellaSwag ≈81.24–85.5%, ARS Challenge ≈57.09–85.2%, WinoGrande ≈74.66–81.6%, GSM-8K ≈11.39–51.7%, reflecting stronger performance by the closed-source model across several benchmarks. - User-facility impact: CALMS can help users navigate proposal processes, safety/operations, experiment design (instrument/modality/sample prep), and real-time instrument operation, potentially broadening access and improving throughput.

Discussion

The findings demonstrate that retrieval- and tool-augmented LLMs can meaningfully assist users at scientific facilities. By injecting accurate, facility-specific documentation into prompts, CALMS reduces hallucinations and increases response relevance and completeness, directly addressing the core challenge of operating complex instruments. The ability to interface with scientific software and hardware enables LLMs not only to advise but to act, as shown by the diffractometer demonstration. GPT-3.5 outperformed the open-source Vicuna in completeness and reliability of structured tool use, highlighting current capability gaps. Nonetheless, advances in instruction tuning, RLHF, and prompting suggest the gap may narrow, enabling broader adoption of open models. The approach supports knowledge transfer, reduces training overhead for new users, and could accelerate experimental planning and execution. The work underscores the importance of high-quality documentation, effective RAG pipelines, and robust tool schemas to ensure faithful, safe, and reproducible assistance.

Conclusion

The paper introduces CALMS, a context- and tool-aware LLM framework for scientific facilities, and demonstrates its utility in experimental planning, operational guidance, and autonomous execution of instrument tasks. With appropriate context retrieval, CALMS provides relevant, truthful, and complete answers, and, when paired with structured tool interfaces, can execute end-to-end workflows (e.g., diffractometer alignment). GPT-3.5 exhibited stronger completeness and tool-use reliability than the tested open-source model. Future work includes improving open-source model adherence to structured tool calls, expanding context windows and retrieval quality, integrating e-log data, enhancing safety and provenance, and progressing toward fully autonomous, robust experimental workflows across diverse instruments and facilities.

Limitations

- Dependence on accurate, comprehensive documentation; limited context windows prevent including all relevant material at once. - Without retrieved context, models hallucinate or provide off-domain yet truthful but unhelpful answers. - Open-source models tested (e.g., Vicuna) struggled with strict JSON/structured tool-call formats, limiting autonomous execution. - Proprietary model reliance (GPT-3.5) introduces cost, privacy, and transparency constraints; cloud inference may incur operational overhead. - Evaluation non-determinism and limited task scope (specific facilities and instruments) may affect generalizability. - Fine-tuning requires substantial compute and expertise; RAG quality depends on embedding models, chunking, and vector store retrieval parameters.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Y. Zhang, X. Chen, et al.

Computer Science

Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models

J. Chen, Y. Zhang, et al.

Linguistics and Languages

DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS

K. Mahowald, I. A. Blank, et al.

Psychology

Shared computational principles for language processing in humans and deep language models

A. Goldstein, Z. Zada, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny