Chemistry

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

H. Park, X. Yan, et al.

Discover how GHP-MOFassemble, a groundbreaking generative AI framework developed by Hyun Park and colleagues, is revolutionizing CO2 capture through the design of innovative metal-organic frameworks (MOFs) with exceptional adsorption capacities. This research highlights MOFs that outperform most existing structures, paving the way for greener technologies.... show more

Introduction

The study addresses the challenge of discovering metal-organic frameworks (MOFs) with high CO2 capture performance within an enormous chemical design space of building blocks and topologies. While MOFs are promising for gas adsorption and separation, their stability (e.g., moisture sensitivity, recyclability) and the sheer number of possible structures hinder experimental screening. The research aims to accelerate rational MOF design by using a generative AI framework that creates synthesizable linkers and assembles them with fixed inorganic nodes and topology, then screens candidates using machine learning and physics-based simulations. The goal is to identify novel, stable, high-capacity MOFs at 0.1 bar and 300 K, and to demonstrate a scalable, high-throughput workflow.

Literature Review

Related work includes: (1) Database search methods that filter large MOF databases (e.g., CORE DB, CSD subsets) to find candidates with specific adsorption/separation properties; (2) ML-assisted screening that trains regressors on smaller datasets to predict adsorption properties (CO2/H2 separation, CH4 adsorption), using geometric and chemical descriptors such as pore size, void fraction, electronegativity, and atomic property weighted radial distribution functions; graph neural networks like Atomistic Line Graph Neural Network have also been applied; (3) Generative modeling approaches (VAEs, GANs, flows, autoregressive, diffusion) for de novo candidate generation; e.g., Supramolecular VAE for isoreticular optimization of MOFs for CO2 capacity/selectivity. Diffusion models (e.g., DiffLinker, EDM, DiGress) have excelled in molecular design, especially in drug discovery. This work transfers diffusion-based molecular generation to MOF linker design while fixing node type and topology for isoreticular exploration.

Methodology

Framework: GHP-MOFassemble comprises Decompose, Generate, and Screen & Predict components, targeting pcu topology with three nodes (Cu paddlewheel, Zn paddlewheel, Zn tetramer).

Dataset analysis: From the hMOF dataset (137,652 hypothetical MOFs), the three most frequent node–topology pairs (Cu PW–pcu, Zn PW–pcu, Zn TM–pcu) account for 102,117 MOFs. After filtering for valid MOFids and SMILES, 78,238 MOFs remain. High-performing MOFs are defined as CO2 capacity > 2 mmol g−1 at 0.1 bar (top 5%). Catenation (cat0–cat3) correlates positively with higher CO2 capacity.

Decompose (fragmentation): For each node–topology subset (e.g., Zn tetramer–pcu), select high performers (>2 mmol g−1 at 0.1 bar), extract unique linker SMILES (limit to at most three linkers per MOF), and fragment linkers using MMPA (as in DeLinker) with minimum connection atoms = 3, min fragment size = 5, min path length = 2, fragments at least two atoms apart. This yields 540 unique molecular fragment conformers across the three node types (Cu PW: 180; Zn PW: 162; Zn TM: 198).

Generate (DiffLinker-based linker generation and assembly):

Use pre-trained DiffLinker (EGNN-based E(3)-equivariant conditional diffusion model trained on GEOM) to connect input fragment pairs with 3D heavy atoms, sampling the number of additional atoms from 5 to 10. For each fragment, sample 20 times per atom-count setting, producing 64,800 heavy-atom linkers. Convert to SMILES and add hydrogens via OpenBabel, yielding 56,257 linkers after removing erroneous hydrogen assignments.
Identify dummy atoms for assembly: for carboxylate-type linkers, replace the two carboxyl carbon atoms with dummies and remove redundant O/H atoms; for heterocyclic N-donor linkers (Cu/Zn paddlewheel), place dummies at N–metal bond distance along vectors from adjacent carbons to terminal nitrogens. Obtain 16,162 linkers with dummy atoms.
Element filter: remove linkers containing S, Br, I to match hMOF element set (C, N, O, F, P, Cl permitted), resulting in 12,305 linkers. Evaluate synthesizability and diversity with SAscore, SCscore, validity, uniqueness, and internal diversity (MOSES). Trends show increased complexity and diversity with higher sampled-atom counts.
MOF assembly: Randomly select triples of generated linkers (duplicates allowed) and assemble with one of the three nodes into pcu MOFs with four catenation levels (cat0–cat3) using site translation (Pymatgen). Randomly sampling 10,000 linkers per node–catenation pair generates 120,000 assembled MOFs.

Screen & Predict:

Inter-atomic distance check: Compute pairwise distance matrix from CIF; discard structures with any interatomic distance below element-pair minimum from OChemDb. 78,796/120,000 pass.
Pre-simulation check: Use cif2lammps to assign UFF4MOF parameters; discard chemically invalid bond topology (unsupported elements, invalid coordinations). 18,770/78,796 pass and have LAMMPS inputs.
CGCNN ensemble screening: Train three modified CGCNN regressors (adjacency list formulation) on hMOF (80/10/10 split) for CO2 capacity at 0.1 bar (5000 epochs, batch 160, Adam lr 1e-4, wd 2e-5). Test performance: R2 ≈ 0.932–0.937, MAE ≈ 0.098–0.100 mmol g−1, RMSE ≈ 0.170 mmol g−1; ensemble test MAE ≈ 0.093 mmol g−1; classifier balanced accuracy 90.7%, overall accuracy 98.4%. Predict capacities for 18,770 MOFs and select high performers with ensemble mean plus standard deviation ≥ 2 mmol g−1 and ensemble std ≤ 0.2 mmol g−1. 364 predicted high performers identified, predominantly cat2/cat3.
MD stability screening: Equilibrate each candidate in LAMMPS (UFF4MOF), triclinic NPT at 300 K, 1 atm, 2×2×2 supercell, 400,000 steps, 0.5 fs timestep. Discard if any lattice parameter (a, b, c, α, β, γ) changes >5%. 102 MOFs remain.
GCMC validation: Assign partial charges via PACMOF (DDEC-trained ML), compute helium void fraction (RASPA), and run GCMC at 0.1 bar, 300 K with UFF4MOF and electrostatics to obtain CO2 excess adsorption. From 102, six MOFs exceed 2 mmol g−1. Structural visualizations and building block compositions provided.

Computational performance: On leadership systems, 120,000 MOFs assembled in 33 min (28 cores); geometry screening in 40 min (128 cores); pre-simulation in 205 min (128 cores); AI inference in 50 min (1 NVIDIA A40). MD: ~11 min/MOF (6–14 MPI ranks). GCMC: ~6 h/MOF (1 CPU). Entire discovery cycle from assembly to GCMC on top candidates completes within ~12 hours using distributed resources.

Key Findings

Generative design and screening at scale: 64,800 linkers generated (heavy atoms) from 540 fragments via DiffLinker; after hydrogen addition and validation, 56,257 linkers; 16,162 linkers with dummy atoms; after element filter (remove S, Br, I), 12,305 linkers used for assembly. 120,000 MOFs assembled (three nodes × four catenation levels) with 78,796 passing distance checks and 18,770 passing pre-simulation checks.
AI screening: CGCNN ensemble identified 364 high-performing MOFs (ensemble mean + std ≥ 2 mmol g−1 at 0.1 bar). High performers were predominantly cat2 and cat3.
Stability and adsorption validation: MD stability filter retained 102 MOFs (<5% change in all lattice parameters). GCMC at 0.1 bar, 300 K identified six MOFs with CO2 capacity > 2 mmol g−1, outperforming 96.9% of hMOF structures.
Quantitative capacities (GCMC vs CGCNN, mmol g−1, mean ± SD): • MOF-1: 3.686 ± 0.017 (GCMC) vs 2.04 ± 0.16 (CGCNN) • MOF-2: 2.532 ± 0.014 vs 2.10 ± 0.15 • MOF-3: 2.518 ± 0.0061 vs 2.06 ± 0.12 • MOF-4: 2.423 ± 0.017 vs 1.91 ± 0.12 • MOF-5: 2.169 ± 0.0047 vs 2.37 ± 0.13 • MOF-6: 2.005 ± 0.016 vs 1.89 ± 0.13
Catenation effect: Both hMOF analysis and AI-generated set show higher proportions of high performers in cat2 and cat3 MOFs; generated distributions exhibit more high-performing cat2/cat3 and fewer high-performing cat0/cat1 relative to hMOF.
Linker chemistry: Generated high-performing linkers tend to include more hydroxyl groups compared to hMOF (which has higher carboxyl prevalence); primary amine and nitrile frequencies are similar. Ring substructures are frequent in top candidates, potentially contributing to rigidity and favorable CO2–ring interactions.
Novelty: Maximum Tanimoto similarity of generated high-performing linkers to hMOF linkers peaks at 0.3–0.4, indicating substantial novelty, with a tail allowing generation of some structurally similar linkers.
Model performance: Ensemble CGCNN achieved R2 ≈ 0.94, MAE ≈ 0.09 mmol g−1 on test; ensemble uncertainty low for ~96% of samples (std < 0.2 mmol g−1); classifier balanced accuracy 90.7%.
Throughput: From assembly to high-performer selection via AI: ~5 h 7 min; including MD and GCMC on finalists, full cycle completes within ~12 h on modern HPC resources.

Discussion

The framework addresses the core challenge of exploring vast MOF chemical spaces by combining generative AI for linker creation with rigorous multi-stage screening and physics-based validation. By fixing node type and topology (isoreticular design) while varying linkers, the diffusion model efficiently samples a broad, chemically valid, and diverse linker space resulting in novel MOFs with strong predicted and simulated CO2 uptake. The predominance of cat2/cat3 among high performers aligns with known benefits of interpenetration (enhanced confinement and MOF–CO2 interactions), reinforcing design guidelines that emphasize catenation control. Functional group analysis suggests that hydroxyl-bearing and ring-containing linkers are enriched among AI-generated high performers, complementing traditional carboxylate-rich designs and revealing alternative chemistries that retain synthesizability (as measured by SAscore/SCscore). The CGCNN ensemble offers reliable pre-screening with quantified epistemic uncertainty, drastically reducing the number of expensive simulations required. Overall, the combined AI–HPC pipeline demonstrates the feasibility of identifying stable, high-capacity, and chemically novel MOFs with turnaround times compatible with high-throughput discovery.

Conclusion

This work introduces GHP-MOFassemble, a scalable generative AI and screening framework that designs MOFs by generating synthesizable linkers and assembling them with selected nodes and topology, followed by AI prediction and MD/GCMC validation. Applied to pcu MOFs with Cu/Zn paddlewheel and Zn tetramer nodes, the framework generated 120,000 candidates and, through successive filters, identified six novel MOFs with GCMC-validated CO2 capacities above 2 mmol g−1 at 0.1 bar and 300 K—placing them in the top ~3% of hMOF performance. The study confirms the importance of catenation and highlights functional group trends and ring motifs in high-performing linkers. The entire workflow can complete within ~12 hours on modern supercomputers, demonstrating practical throughput for materials discovery. Future directions include: expanding node/topology diversity beyond pcu; integrating DFT for higher-fidelity energetics; adapting to broader element chemistries; incorporating moisture and multicomponent separation conditions; using online/active learning to iteratively refine the generative model; and deploying the pipeline across larger datasets to systematically accelerate MOF discovery for carbon capture.

Limitations

Scope limited to pcu topology and three node types (Cu paddlewheel, Zn paddlewheel, Zn tetramer), constraining structural diversity; results may not generalize to other topologies/nodes.
DiffLinker was trained on GEOM; elements S, Br, I were filtered out to match hMOF, reducing chemical space and potentially excluding viable chemistries.
Parsing of hMOF linkers incurred valency and assignment issues (e.g., carboxylate handling), which may bias fragment and linker sets.
CGCNN training relies on the hypothetical hMOF dataset and single-condition (0.1 bar, 300 K) targets; performance under other pressures/temperatures or gas mixtures is not directly addressed.
MD uses UFF4MOF and a 5% lattice-change criterion; force-field limitations and short equilibration times may affect stability assessment.
GCMC assumes rigid frameworks and uses PACMOF-assigned charges; framework flexibility and ab initio charges could change adsorption predictions.
Moisture effects and multi-component selectivities were not treated; real-world operating conditions may alter performance and stability.
Catenation constructed via site translation with fixed interlattice spacing; full exploration of interpenetration geometries was not performed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A Multicenter Randomized Controlled Trial of Microbiome-Based Artificial Intelligence-Assisted Personalized Diet vs Low-Fermentable Oligosaccharides, Disaccharides, Monosaccharides, and Polyols Diet: A Novel Approach for the Management of Irritable Bowel Syndrome

V. Tunali, N. Ç. Arslan, et al.

Environmental Studies and Forestry

A new scheme for low-carbon recycling of urban and rural organic waste based on carbon footprint assessment: A case study in China

K. Zhou, Y. Li, et al.

Computer Science

Understanding the dilemma of explainable artificial intelligence: a proposal for a ritual dialog framework

A. Bao and Y. Zeng

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny