Chemistry

ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models

Y. Kang and J. Kim

Discover the cutting-edge capabilities of ChatMOF, an AI system using advanced large language models to revolutionize the prediction and generation of metal-organic frameworks. This remarkable tool demonstrates the power of AI in materials science, achieving impressive accuracy rates. Research conducted by Yeonghun Kang and Jihan Kim.

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates how large language models (LLMs) can be integrated with databases and machine learning to advance materials science, specifically for metal-organic frameworks (MOFs). While LLMs based on transformer architectures have shown strong capabilities in zero-shot and few-shot reasoning and have been used across chemistry, medicine, and biology, their application to materials science is limited by the complexity of materials representations and scarcity of domain-specific training data. The research question is whether an autonomous LLM-based agent can interpret natural-language queries to retrieve data, predict properties, and inversely design MOFs with target properties. The work introduces ChatMOF, an AI system that coordinates tools to predict MOF properties from text queries and to generate MOFs with specified properties, aiming to lower barriers for non-experts and accelerate materials discovery.

Literature Review

Prior work on LLMs includes foundational transformer models and autonomous agent frameworks (e.g., ReAct, MRKL, Auto-GPT) and their use in scientific domains. In chemistry and materials, efforts have often focused on text mining and response generation rather than direct material generation. Databases such as CoREMOF and QMOF provide geometric and quantum/electronic properties, while MOFkey and DigiMOF enable extraction of building blocks, topology, interpenetration, and synthesis conditions. Inverse design of porous materials has used GANs, diffusion models, variational autoencoders, genetic algorithms, and reinforcement learning, but atom-by-atom MOF design remains challenging due to structural complexity. Recent approaches combine LLMs with external tools through prompt engineering and tool use, highlighting the potential for LLMs to orchestrate databases and ML predictors for materials tasks.

Methodology

ChatMOF is an autonomous LLM-agent system comprising three components: (1) Agent, (2) Toolkit, and (3) Evaluator. Following ReAct and MRKL paradigms, the agent processes user queries through data analysis, action selection, input management, and result observation, iterating plans until a final answer is produced. Toolkit categories include: (a) table-searcher for lookups in curated databases (e.g., CoREMOF with ZEO++-derived geometric descriptors; QMOF for electronic properties; MOFkey for building blocks/topology/interpenetration; DigiMOF for text-mined synthesis conditions), (b) predictor using pretrained and fine-tuned ML models, (c) generator for inverse design, and (d) utilities such as calculators, file I/O, visualizers, internet search, unit converters, and Python REPL. The system leverages LangChain to integrate LLM with tools and adopts the Atomic Simulation Environment (ASE) for structure manipulation and analysis. Prediction: ChatMOF selects appropriate fine-tuned MOFTransformer models based on the query. MOFTransformer encodes local (atoms/bonds) and global (surface area, topology) features, pretrained on one million hypothetical MOFs and fine-tuned for target properties. The agent chooses the correct fine-tuned model, identifies materials to evaluate, executes predictions, and post-processes outputs (e.g., converting log-scale predictions to linear scale) before table searching to return final results. Generation (inverse design): ChatMOF employs a genetic algorithm (GA) where MOFs are represented as text-based genes composed of topology and building blocks (e.g., HKUST-1 as tbo+N17+N10). The agent defines objectives (maximize/minimize/target), selects loss functions, chooses parent genes from databases, and performs selection, crossover, and mutation to generate children genes. Children are converted into structure files (using PORMAKE), evaluated via the predictor, and iterated over fixed cycles to approach target properties. The GA is configured over nine topologies (pcu, dia, acs, rtl, cds, srs, ths, bcu, fsc) with batches of 100 parents and 100 children per topology across three cycles. The evaluator consolidates outputs to produce final answers. Prompt engineering and agent configuration: Prompts for agent and tools follow a question/thought/input/observation/final-thought/final-answer pattern, with tool-specific formats (e.g., predictor expects property/material; generator expects objective/search look-up table/GA steps). Prompts include exemplars and lists of predictable properties to guide responses. ChatMOF experiments used LLMs (GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-16k, Llama-2-7B/13B-chat) without fine-tuning; temperature set to 0.1; code corrections up to three attempts. Data and software details: CoREMOF geometric properties computed with ZEO++ at high accuracy (-ha) using a 3.31 Å hard-sphere (N2). Predictor models derive from MOFTransformer fine-tuned on literature datasets. Example demonstrations used ChatMOF v0.2.0; accuracy measurements used v0.0.0. Utilities include unit conversion workflows that may require additional properties (e.g., density) fetched from tables. Simulation for validation: Generated structures were optimized with UFF (Materials Studio Forcite) without partial charges. H2 adsorption was computed using RASPA with a united-atom H2 model and pseudo-Feynman–Hibbs corrections at 77 K; UFF with Lorentz–Berthelot mixing and 12.8 Å cutoff for non-H2 interactions; GCMC at 100 bar and 77 K with 5000 equilibration and 10,000 production cycles. Accessible surface area after optimization was computed with Zeo++.

Key Findings

- ChatMOF, using GPT-4, achieved high accuracies across tasks: 96.9% (search), 95.7% (prediction), and 87.5% (generation), excluding token-limit cases (100 queries for search/prediction; 10 for generation). - GPT-4 outperformed GPT-3.5-turbo (95% search, 91% prediction, 77.8% generation) and GPT-3.5-turbo-16k did not reduce token limit errors, indicating reasoning/coding strategy matters more than maximum tokens. - Table-searcher and predictor workflows: Example retrieval for MOF LITDAV showed a density of 1.01002 g/cm³, below dataset mean ~1.3732 g/cm³ (range ~0.0569 to 7.4557 g/cm³). Predictor example identified BAZGAM_clean as having the highest predicted H2 diffusivity at 77 K, 1 bar with 0.003017684 cm²/s (converted from log scale). - Multi-tool pipeline example: For XEGKUR CO2 Henry coefficient at 298 K, ChatMOF predicted 0.0265775 mol/kg·Pa, fetched density 1.03463 g/cm³, converted to 0.00103463 kg/cm³, and reported 2.75×10⁻⁵ mol/cm³·Pa. - Inverse design: ChatMOF generated MOFs meeting user objectives. For maximizing accessible surface area (ASA), it produced rtl+N535+N234 with predicted ASA 6411.28 m²/g; after optimization, Zeo++ gave 7647.62 m²/g, ranking near the top of CoREMOF. For targeting H2 uptake ~500 cm³/cm³ (100 bar, 77 K), dia+N719+E186 achieved predicted 499.998 cm³/cm³; validated via RASPA at 495.823 cm³/cm³. - Analysis of errors: Token-limit errors often arose from verbose code outputs (e.g., printing entire tables) and comparative queries; logic errors stemmed from flawed strategies (e.g., wrong model selection or post-processing). GPT-4 demonstrated better planning (e.g., distribution-based comparisons) than GPT-3.5-turbo. - GA execution differences: GPT-3.5-turbo generated children highly overlapping with parents (near duplicates), while GPT-4 achieved ~30% overlap, though child counts and topology consistency varied due to token and generation limits.

Discussion

The results demonstrate that an LLM-orchestrated system can reliably parse natural-language queries, select appropriate databases and models, and deliver accurate MOF property retrievals, predictions, and inverse-designed structures. By combining reasoning capabilities of GPT-4 with curated MOF databases, MOFTransformer predictors, and GA-based generators, ChatMOF addresses key bottlenecks in materials informatics: bridging unstructured language to structured computational workflows and enabling inverse design from high-level specifications. High accuracies on search and prediction tasks validate the planning and tool-use framework, while successful generation of MOFs that meet target properties shows practical feasibility of LLM-guided inverse design. These advances can lower expertise barriers, accelerate hypothesis testing, and streamline materials discovery workflows in MOF research.

Conclusion

ChatMOF introduces a modular AI framework that integrates an LLM agent with databases, machine-learning predictors, and a genetic algorithm generator to predict and design MOFs directly from natural-language queries. It achieves high accuracy on search and prediction tasks and demonstrates successful inverse design of structures meeting user-specified property targets, validated by simulation. The system highlights the potential of LLMs as coordinators for complex, multi-step materials workflows. Future work should focus on improving LLM planning and coding strategies to reduce token/logic errors, expanding training data and fine-tuned model coverage, increasing gene diversity and the number of generations/children, broadening topology sets, and integrating more robust, code-based GA components to enhance scalability, diversity, and reliability of generated structures.

Limitations

- Token input/output limits restrict the number of parents and children considered (~100), reducing gene diversity compared to conventional GA studies that handle orders of magnitude more structures per generation. - Limited topologies (nine) and cycles due to resource/time constraints; generated children can exhibit inconsistent topology and variable counts. - Performance depends on LLM reasoning and coding quality; verbose or suboptimal code can trigger token-limit errors, while flawed planning leads to logic errors (e.g., wrong model selection or misinterpretation of predicted values). - Reliance on availability and scope of precomputed databases and fine-tuned MOFTransformer models; properties not covered may require additional model development. - GPT-3.5 variants showed high parent–child overlap in GA (near-duplicate children), indicating limitations for executing GA with some LLMs.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

H. Park, X. Yan, et al.

Medicine and Health

PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models

S. Gupta, A. Basu, et al.

Medicine and Health

Development and evaluation of an artificial intelligence system for COVID-19 diagnosis

C. Jin, W. Chen, et al.

Business

Gender stereotypes in artificial intelligence within the accounting profession using large language models

K. Leong and A. Sung

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny