Chemistry
Autonomous chemical research with large language models
D. A. Boiko, R. Macknight, et al.
Discover Coscientist, an innovative AI system powered by GPT-4 capable of autonomously designing, planning, and executing complex chemical experiments. Witness how it optimizes palladium-catalyzed cross-couplings and accelerates scientific research, as developed by Daniil A. Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes.
~3 min • Beginner • English
Introduction
Large language models (LLMs), especially transformer-based systems such as GPT-4, have rapidly advanced and demonstrated strong problem-solving abilities across domains. In parallel, laboratory automation has progressed in autonomous discovery, optimization of reactions, flow systems, and mobile platforms. Combining these trends motivates a system that can autonomously design and execute experiments. The authors pose three questions: What are the capabilities of LLMs in the scientific process? What degree of autonomy can be achieved? How can the decisions made by autonomous agents be understood? To address these, they introduce Coscientist, a multi-LLM agent capable of autonomous design, planning, and execution of complex experiments by leveraging internet and documentation search, code execution, and robotic experimentation APIs, and they evaluate it on six representative tasks.
Literature Review
The paper situates Coscientist within rapid advances in LLMs and their applications to natural language, biology, chemistry, and code generation, noting improvements from scaling and techniques such as reinforcement learning from human feedback. GPT-4 exhibits strong performance on standardized tests and problem-solving (including chemistry-related tasks). Concurrently, chemical research automation has advanced in autonomous reaction discovery and optimization, automated flow platforms, and mobile robotic chemists. The work relates to contemporaneous autonomous agent frameworks (Auto-GPT, BabyAGI, LangChain) and chemistry-focused agents (ChemCrow). For information retrieval and documentation use, the paper discusses traditional inverted indices, vector databases with neural embeddings, and approximate nearest neighbor search, highlighting modern transformer-based retrieval that better handles synonyms. The paper also references prior datasets and studies for reaction optimization and chemoinformatics benchmarks used for evaluating reasoning and optimization behavior.
Methodology
System architecture: Coscientist is organized around a GPT-4 "Planner" that orchestrates tools via four commands defining its action space: GOOGLE (web search and browsing through a Web Searcher LLM leveraging Google Search API), DOCUMENTATION (retrieval-augmented reading and summarization of API documentation), PYTHON (code execution in an isolated Docker container with self-correction of errors), and EXPERIMENT (automation via robotic APIs or output of procedures for manual execution). The Planner receives user prompts and command outputs as chat messages and composes plans accordingly.
Web search module evaluation: A benchmark of seven compounds (acetaminophen, aspirin, benzoic acid, ethyl acetate, ibuprofen, nitroaniline, phenolphthalein) was used to assess synthesis-planning quality across models: search-gpt-4, search-gpt-3.5-turbo, GPT-4, GPT-4-0314, GPT-3.5-turbo, Claude-1.3, and Falcon-40B-Instruct. Outputs were labeled on a 1–5 scale (5: detailed, chemically accurate procedure; 4: chemically accurate without quantities; 3: correct chemistry without step-by-step; 2: vague/unfeasible; 1: incorrect/failure). Scores below 3 indicate failure. Models were prompted to provide detailed syntheses; browsing models could ground answers via web search.
Documentation search module: Documentation sections for the Opentrons OT-2 Python API were embedded using OpenAI ada embeddings. At inference, Planner queries were embedded and relevant sections retrieved via distance-based vector search to guide correct API usage (e.g., heater–shaker module methods). For Emerald Cloud Lab (ECL) Symbolic Lab Language (SLL), three investigations were performed: (1) prompt-to-function (mapping user intent to the correct SLL function), (2) prompt-to-SLL (summarizing full function documentation via GPT-4 and generating valid SLL code), and (3) prompt-to-samples (retrieving relevant stock solutions from a catalogue of 1,110 model samples). Generated SLL code (e.g., ExperimentHPLC) was executed at ECL on a caffeine standard; instrument parameters (column, mobile phases, gradients) were estimated by ECL software.
Hardware control: With no internet access, Coscientist controlled an Opentrons OT-2 liquid handler using documentation supplied via the system prompt and vectorized API pages. It executed layout tasks (drawing shapes, coloring rows/diagonals) in a 96-well plate. Integration with a UV–Vis plate reader was added via a "UVVIS" command. In a toy task with three colored wells (red, yellow, blue), Coscientist prepared samples, requested UV–Vis reads, processed the returned NumPy array of spectra via generated Python code to identify peak wavelengths, and inferred colors after a guiding prompt to reason about absorbance.
Integrated chemical experiment design: The system planned and executed Suzuki–Miyaura and Sonogashira cross-couplings using an OT-2 with a heater–shaker module (released after GPT-4’s training cutoff). The source plate contained stock solutions (e.g., phenyl acetylene, phenylboronic acid, aryl halides, catalysts, bases, solvent); the target plate was on the heater–shaker. Coscientist searched the internet for conditions and stoichiometries, selected appropriate partners (e.g., never misusing phenylboronic acid in Sonogashira), computed reagent volumes, and generated Python protocols. After initially using an incorrect heater–shaker method name, it consulted documentation via the Docs searcher, corrected the code, and ran successful reactions. GC–MS analyses of mixtures confirmed product formation.
Chemical reasoning and optimization experiments: Two fully enumerated datasets were used: (1) Suzuki reactions in flow (Perera et al.), varying ligands, bases/reagents, and solvents; (2) Buchwald–Hartwig C–N couplings (Doyle dataset), varying ligands, additives, and bases. The task was posed as a game to maximize yield over up to 20 iterations, with the agent required to output actions and reasoning in a strict JSON schema. The normalized advantage and normalized maximum advantage (NMA) metrics quantified performance versus average and best possible yields. Conditions tested included GPT-4 with 10 prior data points versus GPT-4 and GPT-3.5 without prior information; for the Doyle dataset, GPT-4 was tested with compounds represented by names versus SMILES strings. Bayesian optimization was run as a baseline per a specified procedure in the Supplementary Information.
Safety and disclosures: The authors conducted a brief dual-use safety study (Supplementary Information) and disclose that GPT-4 assisted with grammar/typos in the preprint writing.
Key Findings
- Web-search-grounded GPT-4 (search-gpt-4) achieved the highest synthesis-planning performance, reaching maximum scores across all trials for acetaminophen, aspirin, nitroaniline, and phenolphthalein. It was the only model to achieve the minimum acceptable score (≥3) for ibuprofen. Performance was lower for ethyl acetate and benzoic acid, likely due to their ubiquitous, variable procedures. Non-browsing GPT-4 variants and Claude-1.3 outperformed GPT-3.5 and Falcon-40B-Instruct; all non-browsing models produced incorrect ibuprofen syntheses.
- Grounding via web search reduced hallucinations; GPT-3.5-enabled Web Searcher lagged mainly due to failure to follow output-format instructions.
- Documentation-augmented retrieval with ada embeddings enabled correct use of the Opentrons heater–shaker API and valid ECL SLL code generation (e.g., ExperimentHPLC), which executed successfully in the cloud lab. Prompt-to-function, prompt-to-SLL, and prompt-to-samples workflows correctly mapped intents to functions and retrieved relevant samples from a 1,110-item catalogue.
- Hardware control tasks on the OT-2 (drawing shapes/patterns in a 96-well plate) were executed accurately without internet access. In an integrated UV–Vis task, the system prepared samples, analyzed spectra, and correctly identified well colors after a guiding prompt to reason about absorbance maxima.
- Integrated cross-coupling experiments: Coscientist autonomously designed and executed Suzuki–Miyaura and Sonogashira reactions, selected chemically appropriate partners and bases (e.g., preference patterns for DBU with PEPPSI-IPr, aryl halide selection differences between reactions), corrected API usage via documentation, and successfully ran protocols. GC–MS confirmed target product formation: Suzuki showed a chromatogram signal at 9.53 min with mass spectra matching biphenyl (including fragment at 76 Da); Sonogashira showed a signal at 12.92 min with matching molecular ion and a fragmentation pattern close to the reference.
- Optimization experiments: Normalized advantage increased over iterations, indicating learning from prior outcomes. GPT-4 with 10 prior data points made better initial guesses than GPT-4 without prior data; both converged to similar NMA. GPT-3.5 produced fewer valid iterations due to JSON-formatting failures. Against a Bayesian optimization baseline, GPT-4 approaches achieved higher NMA and normalized advantage (noting differences in exploration/exploitation; NMA is the preferred comparison). For the Buchwald–Hartwig dataset, GPT-4 using names vs SMILES achieved similar performance; the model could reason about electronic effects even with SMILES-only inputs.
- Source usage analysis showed frequent reliance on Wikipedia, with ACS and RSC journals among the top visited sources when planning reactions.
- Overall, Coscientist demonstrated versatile, explainable, semi-autonomous experimental design and execution across multiple hardware modules and data sources.
Discussion
The findings show that LLMs, when augmented with tool use (web search, documentation retrieval, code execution, and robotic APIs), can participate meaningfully in the scientific process: designing, planning, and executing experiments. Grounding via web search and documentation reduces hallucinations and enables correct interface with real-world hardware. The degree of autonomy achieved spans from planning and generating executable protocols to self-correction via documentation lookup; minimal human intervention was limited to physical transfers (e.g., manual plate moves) rather than decision-making. The system’s capacity to justify reagent choices and reflect on reactivity/selectivity provides a window into its decision process, supporting explainability. In optimization tasks, increasing normalized advantage and competitive NMA versus Bayesian optimization indicates that the agent can iteratively use prior results to improve choices, even with sparse sampling. Collectively, the results address the initial research questions by demonstrating capabilities across the scientific workflow, quantifying autonomous performance, and illustrating how rationale and source tracking can make decisions interpretable.
Conclusion
This work introduces Coscientist, a GPT-4-driven agent that integrates information retrieval, documentation understanding, code execution, and laboratory automation to autonomously design and perform complex chemical experiments. It achieves strong synthesis-planning performance when grounded by web search, correctly interprets and applies API documentation to generate valid protocols (including ECL SLL), controls laboratory hardware to execute multi-module workflows, and confirms reaction outcomes analytically. In data-driven optimization, it leverages prior results to improve yields efficiently and compares favorably to a Bayesian optimization baseline under the reported metrics. Future directions include integrating proprietary reaction databases (e.g., Reaxys, SciFinder) to enhance multistep synthesis planning, adopting advanced prompting strategies (ReAct, Chain-of-Thought, Tree-of-Thoughts), expanding automation to reduce remaining manual steps, improving quality control in cloud labs, refining safety and dual-use governance, and broadening support for diverse hardware ecosystems.
Limitations
- Subjective labeling: The synthesis benchmark scoring (1–5) is inherently subjective, which can affect comparability.
- Failure modes and variability: Non-browsing models frequently hallucinated steps (e.g., ibuprofen), and browsing GPT-3.5 often failed due to format adherence issues. The web-grounded model underperformed on ubiquitous compounds (ethyl acetate, benzoic acid), likely due to variability in procedures online.
- Partial automation: The integrated experiment workflow required manual plate movements; full physical automation was not achieved.
- Guidance dependence: Some tasks required guiding prompts (e.g., reasoning about UV–Vis absorbance–color relationships) to reach correct conclusions.
- Documentation reliance and corrections: The system initially used an incorrect heater–shaker method name and needed documentation retrieval to self-correct, indicating sensitivity to API details.
- Experimental artifacts: In cloud lab HPLC runs, an air bubble was injected with the analyte, highlighting the need for automated quality control.
- Data and training uncertainty: It is unclear whether GPT-4 training data contained information from the optimization datasets; if so, prior knowledge might influence initial guesses. GPT-3.5 frequently failed to produce valid JSON outputs, limiting its evaluable iterations.
- Baseline comparison caveats: Differences in exploration/exploitation balance complicate direct comparisons to Bayesian optimization; NMA is preferred but conclusions are context-dependent.
- Source bias: Web searches frequently returned Wikipedia; while useful, this may bias information sources relative to primary literature.
Related Publications
Explore these studies to deepen your understanding of the subject.

