logo
ResearchBunny Logo
A dynamic knowledge graph approach to distributed self-driving laboratories

Chemistry

A dynamic knowledge graph approach to distributed self-driving laboratories

J. Bai, S. Mosbach, et al.

Dive into the innovative architecture for distributed self-driving laboratories developed by Jiaru Bai, Sebastian Mosbach, Connor J. Taylor, and their team. This research showcases a dynamic knowledge graph that radically enhances design-make-test-analyze cycles through autonomous agents, culminating in a remarkable closed-loop optimization for an aldol condensation reaction across continents in just three days.... show more
Introduction

Self-driving laboratories (SDLs) integrate automation, computation, and analytics to accelerate discovery, but current implementations are often centralized and siloed. Addressing global challenges requires decentralizing SDLs to enable cross-organization resource sharing while ensuring interoperability and robust data provenance. The study proposes a dynamic knowledge graph architecture, within The World Avatar project, to orchestrate heterogeneous resources across labs, standardize research communication via ontologies, and capture FAIR-compliant provenance. The research goal is to demonstrate that a goal-driven, agent-based dynamic knowledge graph can coordinate distributed SDLs to collaboratively optimize a chemical reaction in real time, overcoming challenges of resource orchestration, data sharing, and provenance recording.

Literature Review

Prior efforts have targeted individual aspects of collaborative automation: middleware for resource orchestration (e.g., ChemOS, ESCALATE, HELAO) abstract hardware and coordinate workflows; data standards (XDL for synthesis, AnIML for analysis) facilitate data exchange; and provenance tools (e.g., FAIR data pipelines, experiment knowledge graphs) capture lineage. However, these solutions often remain isolated with customized data interfaces that hinder interoperability. Semantic web technologies and knowledge graphs have been advocated to unify data and resources across domains. The World Avatar project extends this by enabling executable knowledge through agents, dynamic updates, and cross-domain ontology integration (e.g., OntoCAPE for processes, SAREF for IoT). This work builds on these threads by integrating resource abstraction, workflow orchestration, and provenance in a single dynamic knowledge graph framework to enable distributed SDLs.

Methodology

Architecture: A three-layer dynamic knowledge graph approach integrates real-world labs, a cyber-layer knowledge graph, and active software agents. The framework captures data, software, hardware, and workflow flows for design–make–test–analyse (DMTA) cycles. Agents are defined with inputs/outputs (OntoAgent) and implemented using a derivation agent template to manage iterative, asynchronous workflows via a derived information framework.

Ontologies and data model: Cross-domain ontologies encode goals (OntoGoal), reactions and experiments (OntoReaction), species (OntoSpecies), design of experiments (OntoDoE), lab digital twins and equipment (OntoLab, OntoVapourtec, OntoHPLC), and process concepts (OntoCAPE, SAREF). ReactionExperiment instances capture ReactionConditions, PerformanceIndicators, input/output chemicals, HPLC reports and chromatograms, and link to equipment settings via ParameterSetting and EquipmentSettings. Species identity and impurities are contextualized with OntoSpecies and OntoLab chemical amount constructs; future work will extend concentration and impurity modeling.

Goal-driven workflow: A scientist submits a high-level request (e.g., optimize cost and yield under constraints). The Reaction Optimisation Goal (ROG) Agent instantiates a GoalSet and decomposes objectives. For each participating lab, a Goal Iteration Derivation is created and executed by the ROGI Agent, which orchestrates agents for DoE, scheduling, hardware control, and post-processing. The DoE Agent (e.g., TSEMO-based) proposes new experiments using prior data and stock availability. The Schedule Agent selects and configures available hardware (digital twins) and triggers equipment agents (Vapourtec, HPLC). The Post-Processing Agent analyzes HPLC chromatograms to compute objectives, which feed back to ROG for evaluation of goal progress and resource usage. Iterations continue until goals are met or resources exhausted.

Deployment: The triplestore and file server are internet-accessible; agents run in Docker containers across sites. Hardware-controlling agents are deployed within labs for security, pushing data to and pulling instructions from the knowledge graph. Agents register themselves and operate autonomously with asynchronous communication, designed to tolerate network disruptions and resume upon reconnection.

Experimental demonstration: Two flow chemistry platforms in Cambridge and Singapore executed an aldol condensation between benzaldehyde and acetone, catalyzed by NaOH, targeting multi-objective optimization of yield and run material cost. Variables: molar equivalents of acetone and NaOH (relative to benzaldehyde), residence time, and temperature. No prior data were provided; initial conditions were random, followed by Bayesian optimization with TSEMO. Control conditions were validated across both labs prior to optimization.

Hardware details: Cambridge: two Vapourtec R2 pumps, Vapourtec R4 reactor, Gilson GX-271 liquid handler, VICI 4-way switching valve, Shimadzu CBM-20A HPLC with Eclipse XDB-C18; mobile phase water/acetonitrile 80:20 v/v at 2 mL/min, 254 nm detection, 17 min analysis. Singapore: two Vapourtec R2 pumps, Vapourtec R4 reactor, VICI 6-port valve, Agilent 1260 Infinity II with Eclipse XDB-C18, gradient HPLC (0.2 to 1.0 mL/min; acetonitrile:water ramp 5:95 to 95:5 then back; 8 min total), VWD with wavelength switch (248 nm then 228 nm). Reagents as specified with internal standards (biphenyl in Cambridge, naphthalene in Singapore).

Key Findings
  • The dynamic knowledge graph with agents successfully coordinated two geographically separated SDLs (Cambridge and Singapore) to perform a collaborative, real-time, multi-objective closed-loop optimization of an aldol condensation.
  • Data sharing between SDLs occurred in real time, enabling both sites to inform subsequent DoE proposals collaboratively.
  • A total of 65 experiments were conducted during self-optimization, producing a Pareto front for cost vs. yield with a highest observed yield of 93%.
  • Although cost figures are vendor-dependent and not directly comparable to prior literature, the best environment factor and space-time yield achieved (not direct optimization targets) were 26.17 and 258.175 g L^-1 h^-1 (scaled to 5 mL benzaldehyde injection), outperforming prior reported values.
  • Variable effects: cost increased linearly with molar equivalents of starting materials; temperature correlated positively with yield; residence time had weak correlation; acetone equivalents above ~30 reduced yield due to increased side-product formation.
  • System robustness: an HPLC fault in Singapore produced an anomalous >3500% apparent yield; the agents flagged it as abnormal, removed it from subsequent DoE, and notified maintainers. The distributed asynchronous design allowed Cambridge to continue optimizing and advancing the Pareto front.
  • The campaign stopped when hypervolume improvement plateaued and equipment needed repurposing; complete provenance was captured in the knowledge graph and used to generate interactive progress visualizations.
Discussion

The study demonstrates that a dynamic knowledge graph with ontology-driven agents can unify resource orchestration, data sharing, and FAIR provenance across distributed SDLs. By virtualizing hardware as digital twins and encoding workflows as derivations, the system decomposes abstract goals into actionable tasks, coordinates heterogeneous equipment, and records machine-readable provenance suitable for analysis and reproducibility. The collaborative optimization across two labs advanced the Pareto front more rapidly than isolated operation, and the architecture exhibited resilience to hardware failure by isolating faults and maintaining progress elsewhere. Compared to RPC-based or centralized orchestration approaches, the knowledge graph enables real-time workflow assembly, extensibility under the open-world assumption, and a unified data layer that reduces peer-to-peer data transfer overhead. The findings support the feasibility and value of dynamic knowledge graphs as an integration hub for globally connected SDLs and provide a foundation for informed machine learning and cross-lab reasoning over DMTA cycles.

Conclusion

This work presents the first end-to-end demonstration of a goal-driven, agent-based dynamic knowledge graph coordinating distributed self-driving laboratories. New ontologies (OntoReaction, OntoDoE, OntoLab, and equipment-specific ontologies) and the derived information framework enable real-time orchestration of experiments, data analysis, and provenance capture across geographically separated labs. The case study on aldol condensation produced a strong Pareto front, high yield, and improved secondary metrics, while showcasing robustness to hardware faults. The approach is general and extensible to other DMTA domains. Future directions include: improving network resilience via local graph deployments and seamless reconnection; enhancing QC with human-in-the-loop strategies; federating SDLs with authentication/authorization and capability registries; scaling data performance via ontology-based data access; and expanding ontologies for finer-grained experimental procedures (e.g., ontologizing XDL) and impurity/concentration modeling.

Limitations
  • Provenance granularity: current focus is on inputs/outputs within DMTA steps; detailed operation timing (e.g., robotic motions) is not captured due to API limitations, limiting fine-grained uncertainty analysis.
  • Network and deployment: distributed operation requires robustness to internet disruptions; while agents can resume after cut-offs, further work is needed for on-demand, localized graph deployment to ensure uninterrupted operation.
  • Data quality and validation: adding new setups requires control conditions; two control points may be insufficient for complex, high-dimensional reactions. Cross-workflow validation strategies are needed.
  • Quality control monitoring: automated detection of abnormal data (e.g., instrument faults) needs strengthening; a human-in-the-loop approach is advisable.
  • Scalability and performance: large-scale data (e.g., ORD-scale) may challenge triple-store performance; hybrid approaches (ontology-based data access over relational databases) are needed.
  • Security and federation: a federated model with local data/digital twins needs authentication/authorization and standardized capability exposure.
  • Chemical representation: impurity and concentration modeling is simplified; more comprehensive ontological treatment is deferred to future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny