
Chemistry
Accelerating the prediction of CO2 capture at low partial pressures in metal-organic frameworks using new machine learning descriptors
I. B. Orhan, T. C. Le, et al.
This research conducted by Ibrahim B. Orhan, Tu C. Le, Ravichandar Babarao, and Aaron W. Thornton introduces innovative machine learning techniques to efficiently screen metal-organic frameworks for CO2 capture, significantly optimizing computation time and maintaining effective performance for direct air capture applications.
~3 min • Beginner • English
Introduction
The study addresses the challenge of rapidly identifying metal-organic frameworks (MOFs) with high CO2 capture capacity at low partial pressures, a key requirement for direct air capture (DAC) and related applications. Rising atmospheric CO2 levels and the need for economical carbon capture and storage (CCS) motivate faster, scalable screening methods. While MOFs are promising due to tunable structures and adaptability, evaluating the vast and growing MOF chemical space via conventional simulations or experiments is infeasible. Machine learning (ML) trained on molecular simulation outputs offers a route to accelerate screening, but requires informative descriptors that capture key adsorption physics under low-pressure conditions where electrostatics are crucial and humidity complicates selectivity. This work proposes new ML descriptors—Effective Point Charge (EPoCh)—to encode electrostatic effects from framework partial charges, and systematically evaluates their utility along with geometrical, chemical, and energy (Henry coefficient) features to predict low-pressure CO2 adsorption and enable rapid down-selection of candidates for DAC-relevant pressures (40 Pa, 1 kPa, 4 kPa).
Literature Review
Prior ML efforts have proven effective for gas adsorption/selectivity in porous materials. Aghaji et al. used geometrical descriptors with classification models to identify MOFs for methane purification with high CO2 uptake/selectivity, achieving AUC ~0.95. Anderson et al. combined DFT-optimized binding configurations and GCMC simulations to train multiple ML models on 400 computationally constructed MOFs, reaching R2 up to 0.905 and using genetic algorithms for design insights. Broader literature shows ML’s growing role across domains and in MOF screening for gas separations. However, applicability to DAC-like low pressures, where electrostatics dominate, and the need to balance accuracy with computational cost, remains a gap. This study builds on these works by introducing electrostatics-focused EPoCh descriptors and benchmarking them against traditional geometrical/chemical features and the Henry coefficient for low-pressure CO2 uptake prediction.
Methodology
Dataset curation: Two datasets were combined: CORE MOF (3378 structures) and Anion-pillared MOF (936 structures). Partial atomic charges were obtained via DFT-based DDEC methods. No feature-based pre-screening was applied. Descriptors were grouped as: Atom type (A: per-unit-volume counts of H, C, N, F, Cl, Br, V, Cu, Zn, Zr), Geometric (B: surface areas, volumes, pore diameters, density, unit cell volume via Zeo++), Chemical (C: total degree of unsaturation, metallic percentage, O:metal ratio, electronegative-to-total ratio, weighted electronegativity per atom, N:O ratio), Effective Point Charge (D: charge-based uptake estimates at 40 Pa, 1 kPa, 4 kPa, reported as totals, per-atom averages, and per-unit-volume), and Energy (E: Henry coefficient KH for CO2; KH for H2O used for hydrophobicity analysis). Dataset available in Supplementary Data 1.
Molecular simulations (targets and energy descriptors): CO2 uptakes at 40 Pa, 1 kPa, and 4 kPa were computed via GCMC using RASPA with a 12.5 Å cutoff, 20,000 cycles, T=298 K. Framework VDW parameters used UFF; Coulomb interactions used DDEC charges. CO2 move probabilities: translation 0.5, rotation 0.5, reinsertion 0.5, swap 1. Simulations exceeding 24 h or with errors were discarded. Henry coefficients were computed via Widom insertion (probability 1) with ideal gas Rosenbluth weights; CO2 KH values were used as ML descriptors, H2O KH values to assess hydrophobicity (H2O modeled with TIP4P/2005). Zeo++ used a probe radius of 1.5 Å; 50,000 sample points for pore volumes and 2000 for surface areas.
Effective Point Charge (EPoCh) descriptors: To isolate electrostatic contributions, hypothetical single-atoms without VDW and mass were simulated in RASPA with assigned charges from −5e to +5e across sub-0.1 bar pressures. From sampled data, adsorption was fitted to a polynomial surface: f(Q,p) = a1 Q + a2 Q2 + a3 Q3 + a4 Q4 + a5 Q5 + a6 Q6 + a7 Q7 + a8 p + a9 p2 + a10 p3 + a11, where Q is partial charge and p is partial pressure; fitted coefficients provided in SI. For each MOF, for each atom i with charge Qi, Ei = max(0, f(Qi,p)) is computed (mol cm−3) and aggregated as averages per atom and per unit volume to form the EPoCh feature set at the specified pressures.
Machine learning: Random Forest (RF) models (Scikit-Learn) were trained separately for each pressure, using an 80/20 train/test split. Performance was assessed via R2 (r2_score) and RMSE. Feature group combinations evaluated: A+B+C (benchmark), A+B+C+D (adding EPoCh), A+B+C+E (adding Henry coefficient), and All (A+B+C+D+E). A pseudo-classification threshold of 1 mmol g−1 uptake was used to compute precision and recall for screening purposes. Computational time to obtain each descriptor group was estimated using timings from the Anion-pillared dataset on HPC (GCMC, Widom, Zeo++) and a desktop (EPoCh, chemical, atom-type) to evaluate speed-accuracy trade-offs.
Humidity/hydrophobicity analysis: H2O KH thresholds from literature were used to identify hydrophobic MOFs (e.g., KH(H2O) < 1.0×10−5; stricter 2.6×10−7). Comparative analysis of KH(CO2) vs KH(H2O) identified MOFs with stronger affinity for CO2 than H2O.
Key Findings
- Dataset and correlations: 12,637 GCMC simulation results across 3 pressures (approx. 4243 at 0.4 mbar ≈ 40 Pa, 4186 at 0.1 mbar, 4208 at 0.04 mbar; some differences due to 24 h cutoff). EPoCh descriptors show among the strongest positive correlations with CO2 uptake, followed by several chemical descriptors.
- Henry’s law validity: The distribution of KH(CO2) spans an enormous range (max ~10^30 times minimum). Henry’s law aligns with GCMC only at very low KH and very low pressures. For MOFs with KH ≤ 0.001, R2 between Henry’s-law-calculated uptake and GCMC is 0.982 (40 Pa), 0.924 (1000 Pa), but drops to 0.206 at 4000 Pa. In the full dataset, Henry’s law shows essentially no predictive power (R2 ~ −9.8×10^15). Thus, Henry’s law alone cannot identify high-uptake MOFs at the studied pressures.
- ML performance (R2): At 40 Pa, A+B+C baseline R2=0.541; adding EPoCh (D) improves to 0.715; adding Henry coefficient (E) gives 0.916; using all descriptors yields R2>0.9 across all pressures. Without energy descriptors, test R2 values were 0.715 (40 Pa), 0.742 (1 kPa), 0.698 (4 kPa). With all descriptors, test R2 values were 0.917 (40 Pa), 0.936 (1 kPa), 0.933 (4 kPa).
- Pseudo-classification (threshold 1 mmol g−1): Using all descriptors, recall = 0.969 (40 Pa), 0.975 (1 kPa), 0.983 (4 kPa); precision = 0.849, 0.914, 0.952, respectively. Without Henry coefficient, recall = 0.719, 0.838, 0.883; precision = 0.807, 0.778, 0.791.
- Error ranges without Henry: Largest discrepancies between simulated and predicted were up to 3.48 mmol g−1 (test) at 40 Pa; 3.8 mmol g−1 (test) at 1 kPa; 3.96 mmol g−1 (test) at 4 kPa. The model tended to slightly overpredict the count of MOFs above 1 mmol g−1.
- Computational cost: Estimated time to process 10,000 MOFs: GCMC ~1.09×10^8 s (~3.45 years); Henry coefficient ~3.27×10^7 s (~1 year); geometric descriptors ~5.4×10^4 s (~15 h); EPoCh ~20 s; atom type ~3.7 s. Time-weighted metrics show EPoCh-based models outperform Henry-based models by >450× in adjusted R2 and ~30,000% in time-weighted RMSE.
- Feature importance: When included, Henry coefficient dominates RF importance (~0.83 of total), with all other descriptors summing to ~0.17. Excluding Henry, EPoCh descriptors become the most influential features (per MDI analysis).
- Generalization across pressures (single model with pressure as a feature, excluding Henry): Interpolation to 400 Pa gave R2=0.54; extrapolation to 10,000 Pa gave R2=0.72. Precision remained high (0.99) but recall dropped (0.42 at 400 Pa; 0.74 at 10,000 Pa), suggesting pressure-specific models are preferable without Henry.
- Hydrophobicity: With KH(H2O) < 1.0×10−5, 249 MOFs were classified as hydrophobic; they showed lower EPoCh averages and KH(CO2) (median 2.43×10−5; max 9.56×10−4) and lower CO2 uptakes: medians ~0.04 (40 Pa), 0.62 (1 kPa), 1.77 mmol g−1 (4 kPa). With stricter KH(H2O) < 2.6×10−7, only 71 MOFs qualified; maximum uptake at 4 kPa was 0.411 mmol g−1 versus 7.93 mmol g−1 in the full dataset. Selecting MOFs with KH(CO2) > KH(H2O) yielded >600 candidates; 206 datapoints exceeded 1 mmol g−1 (only 7 at 40 Pa, including four SIFSIX MOFs). SIFSIX-3-Cu achieved 2.49 mmol g−1 at 40 Pa, aligning with reported experimental uptake near 0.1 bar; at 1 kPa and 4 kPa, maximum uptakes were 4.841 mmol g−1 (LOGBEO) and 5.332 mmol g−1 (SIHLUQ).
Discussion
The study demonstrates that ML models enriched with electrostatics-focused EPoCh descriptors can accurately and efficiently predict low-pressure CO2 uptake in MOFs, addressing the need for rapid screening in vast materials spaces. While the Henry coefficient is the single most informative descriptor, it is computationally expensive and often fails as a direct predictive model for uptake outside the linear Henry regime at the pressures of interest. EPoCh descriptors capture the impact of framework partial charges on adsorption without requiring costly Widom insertion simulations, delivering strong predictive performance—especially for pseudo-classification of top performers—at a tiny fraction of the computational time. This enables practical pre-screening of large hypothetical MOF libraries and prioritization for detailed simulation or synthesis.
The analysis also highlights the importance of moisture: many MOFs with favorable CO2 electrostatic interactions may preferentially adsorb H2O under ambient conditions, reducing their practical utility for DAC. Incorporating hydrophobicity criteria via KH(H2O) or developing ML models for H2O uptake would further refine candidate selection. Finally, models should be tailored to specific pressures unless the Henry coefficient is available, as pressure-generalized models without energy descriptors suffer in recall at untrained pressures.
Conclusion
Expanding beyond conventional geometric and atom-type descriptors, the inclusion of energy (Henry coefficient) and especially the newly introduced EPoCh descriptors markedly improves ML prediction of low-pressure CO2 uptake in MOFs. EPoCh, derived from partial charges via an efficient surrogate of electrostatic adsorption, emerges as the second-most informative descriptor group after the Henry coefficient, yet is orders of magnitude faster to compute. Using all descriptors, RF models achieve R2 > 0.9 across 40 Pa, 1 kPa, and 4 kPa, and deliver high recall and precision for identifying MOFs exceeding 1 mmol g−1 uptake. Time-weighted analyses underscore EPoCh’s superior speed-accuracy trade-off compared with Henry-based features. Considering humidity constraints is crucial, and integrating H2O uptake predictions would enable multi-objective screening. Future work could extend EPoCh-like descriptors to other gases, incorporate functional group information, model mass transfer and regeneration properties, and develop pressure-generalized models with improved recall, thereby further accelerating discovery of MOFs for DAC and other low partial pressure applications.
Limitations
- Henry’s law applicability is limited beyond the infinite dilution regime; it cannot directly predict uptake at the studied pressures for high-affinity MOFs.
- The EPoCh descriptors capture only electrostatic interactions and neglect van der Waals contributions, relying on accurate partial atomic charges (DDEC); inaccuracies in charges can affect predictions.
- No direct ML model for H2O uptake was built due to lack of data; moisture impacts were assessed indirectly via KH(H2O), limiting comprehensive multi-gas screening.
- Mass transfer kinetics and adsorbent regeneration conditions were not addressed, though critical for process performance.
- Separate models were trained per pressure; without Henry coefficients, pressure-generalized models showed lower recall at untrained pressures.
- Energy grids and full isotherms were not computed to save time, potentially omitting information that could enhance model fidelity.
Related Publications
Explore these studies to deepen your understanding of the subject.