logo
ResearchBunny Logo
Machine learning insights into predicting biogas separation in metal-organic frameworks

Chemistry

Machine learning insights into predicting biogas separation in metal-organic frameworks

I. Cooley, S. Boobier, et al.

This groundbreaking research by Isabel Cooley, Samuel Boobier, Jonathan D. Hirst, and Elena Besley leverages machine learning to revolutionize biogas fuel efficiency through enhanced separation of carbon dioxide and methane. Discover how carefully curated data from Monte Carlo simulations can drive innovation in metal-organic frameworks, achieving remarkable accuracy in gas uptake predictions.

00:00
00:00
~3 min • Beginner • English
Introduction
Biogas upgrading requires efficient separation of CO₂/CH₄ mixtures to produce biomethane. Porous materials such as MOFs offer promising routes for adsorptive and membrane-based separations, but selecting materials that simultaneously combine high CO₂ uptake with strong CO₂-over-CH₄ selectivity is challenging due to a trade-off between uptake and selectivity and, for membranes, the added role of diffusivity. The vast chemical and structural diversity of experimental and hypothetical MOFs makes exhaustive experimental characterization intractable. While high-throughput molecular simulations (e.g., GCMC) can screen thousands of structures, they are computationally costly. This study asks whether machine learning models, trained on carefully curated and experimentally reported MOFs with simulated adsorption properties, can accurately and efficiently predict biogas separation metrics—particularly CO₂ and CH₄ uptakes and a combined trade-off metric (TSN)—to preselect promising candidates and guide structure–property insights for biogas upgrading.
Literature Review
MOFs have been widely explored for gas storage and separations, with large databases of both experimental and hypothetical structures enabling high-throughput computational screening. Prior studies show that ML can predict materials properties, including sorption and separations, with non-linear models often outperforming linear ones. Feature selection typically favors structural descriptors (e.g., pore metrics) due to low computational cost; however, energetic/chemical descriptors (e.g., Henry constants, heats of adsorption, binding energies) can improve model fidelity. Many previous ML studies relied on large, loosely curated hypothetical MOF sets, risking inclusion of unviable structures and data leakage via duplicates. Recent work emphasizes curation for better model accuracy and interpretability. For CO₂/CH₄ separations, linear models with a few structural features can perform reasonably, but incorporating chemical descriptors may further enhance predictions. This study contributes by training ML on a rigorously curated set of experimental MOFs and evaluating model transfer to an independent hypothetical MOF set.
Methodology
Dataset curation: Experimental MOF structures were taken from the CSD-derived set used in a prior biogas upgrading screen. Starting from 6768 solvent-stripped MOFs, additional stringent curation removed physically unviable entries and duplicates. Steps included: removal of structures lacking essential elements or with overlapping atoms; standardizing/deleting problematic H/D entries; oxidation-state and electron-count checks using MOSAEC to remove structures with unviable metal oxidation states (3553 flagged, leaving 3086); dimensionality filtering to retain only 2D/3D frameworks (removing 0D/1D, leaving 2401); excluding structures where charge equilibration failed (20 removed, leaving 2381); and deduplication using Weisfeiler–Lehman graph hashes and CSD refcode families, yielding a final curated training set of 1910 MOFs. Simulations and labels: GCMC simulations (RASPA) computed adsorption for pure CO₂, pure CH₄, and a 50/50 CO₂/CH₄ binary mixture at 298 K and 10 bar. Gases used TraPPE models (CO₂ three-site with charges; CH₄ single-site), MOF atoms used UFF LJ parameters; framework partial charges from eQeq. LJ cutoff was 12.8 Å (shifted to zero beyond cutoff), Ewald precision 1e-6 for electrostatics; sufficient supercells ensured minimum perpendicular distance ≥25.6 Å. Each simulation used 10,000 equilibration and 10,000 production cycles, with MC moves including translation, rotation, insertion/deletion, identity swaps for mixture. Gas-phase fugacities used Peng–Robinson EOS. Derived metrics: single-component and binary-mixture loadings (mol kg⁻¹), CO₂/CH₄ selectivity S = (q_CO2 y_CH4)/(q_CH4 y_CO2), and TSN = N_CO2 × log(S). Because TSN was skewed, log10(TSN) was also used for regression; TSN classification used a threshold of 5 mol kg⁻¹ to define HIGH vs LOW. Descriptors: From 21 candidate descriptors (structural and energetic) computed with Zeo++ (PLD, LCD, density, VSA, VF, etc.) and RASPA (Widom insertion heats of adsorption at infinite dilution), nine were selected after correlation analysis and interpretability considerations: PLD, LCD, density, VSA, VF, Qf(CH₄), Qf(CO₂), Qf(H₂S), Qf(H₂O). Correlated features (e.g., K₀ vs Q₀, PV/GSA vs VF) were removed; log10 scaling was applied to skewed variables (PLD, LCD, PV, and some infinite-dilution properties in preliminary analyses). Machine learning: Models were implemented in scikit-learn. Regression targets: SC CO₂ loading, BM CO₂ loading, SC CH₄ loading, BM CH₄ loading, TSN, log10(TSN). Classification target: TSN HIGH/LOW (cutoff 5 mol kg⁻¹). Algorithms: Random Forest (RF, 500 trees, default other params), Support Vector Machine (RBF kernel with nested CV for C, γ, ε), Multiple Linear Regression (regression baseline), and k-Nearest Neighbours (classification; k tuned 1–99). Evaluation used 10-fold CV on the curated dataset: R² and MAE for regression; for classification, precision, recall, F1 for each class, accuracy, Brier score, ROC-AUC, and probability calibration behaviors. Feature importance for RF used feature_importances_ averaged across folds. External test set: 1000 MOFs randomly selected from the Northwestern hypothetical MOF database were curated via the same workflow, including geometric and charge criteria and de-duplication/similarity checks to avoid leakage, yielding 330 hypothetical MOFs. The same nine descriptors and GCMC labels were computed for testing model transferability. Models were retrained on the full curated experimental set and evaluated on this external test set.
Key Findings
- Data curation and simulation landscape: From 6768 initial solvent-stripped structures, stringent curation produced a high-quality set of 1910 2D/3D experimental MOFs. GCMC revealed trade-offs between CO₂ uptake and selectivity; very high selectivity often coincided with low CH₄ loading and not necessarily high CO₂ uptake. - Top GCMC performers: The highest TSN MOFs commonly exhibited Zn metal centers, sql or related topologies, narrow pores (small PLD close to CH₄ kinetic diameter ~3.8 Å), and often 2D layered structures. Void fractions (VF) among top performers typically ranged 0.5–0.7. - Cross-validated ML performance (curated training set): • RF regression achieved strong accuracy for uptake targets: SC CO₂ loading R²=0.954, MAE=0.657; BM CO₂ loading R²=0.906, MAE=0.645; SC CH₄ loading R²=0.883, MAE=0.437; BM CH₄ loading R²=0.927, MAE=0.135. TSN regression was less accurate with skewed errors (TSN R²=0.772, MAE=1.224; log10 TSN R²=0.835, MAE=0.097), with notable underprediction for high-TSN cases. • RF TSN classification (HIGH vs LOW at 5 mol kg⁻¹) showed excellent performance: accuracy 0.881, AUC 0.957, Brier 0.083; HIGH class precision 0.866, recall 0.904; LOW class precision 0.898, recall 0.856. Misclassifications were mostly within ±2 mol kg⁻¹ of the threshold. The model correctly classified all six top-performing (high TSN) MOFs identified by GCMC, with high confidence for five of six. - Descriptor importance and structure–property insights: • Structural descriptors dominate uptake predictions: VF was the most important for CO₂ uptake; density was most important for CH₄ uptake. For TSN, a mixture of structural and energetic features mattered, notably Qf(H₂O) (a proxy for polarity/hydrophilicity) contributing more to TSN than Qf(CO₂) or Qf(CH₄) individually. • CO₂ uptake vs VF displayed a peak near VF ~0.8; MOFs with very high VF were often overpredicted in training and more so in the external set. - External hypothetical MOF test set (330 MOFs): • RF SC CO₂ regression: MAE 2.275 vs test-set SD 3.796, R²=0.332; mid/high-loading MOFs were underpredicted, attributed to differing descriptor distributions (more very high VF structures) and feature–target relationships from training. • RF TSN classification: accuracy 0.712, AUC 0.850; excellent identification of LOW MOFs (recall 0.930), but lower recall for HIGH MOFs (0.372), partly due to class imbalance and distribution shift. The model confidently assigned correct labels at probabilities <0.2 or >0.8. - Screening utility: Best-trained models filtered out over 90% of low-performing unseen MOFs, enabling substantial reduction of costly GCMC evaluations. Predicted top external MOFs often showed interpenetrated pcu frameworks, with TSN-favored structures having VF below ~0.8 and PLD in the ~4.4–6.7 Å range.
Discussion
Training ML on a rigorously curated set of experimental MOFs enables accurate prediction of CO₂/CH₄ uptakes with simple, interpretable descriptors and supports effective classification of materials by a practical trade-off metric (TSN). The results address the core challenge of preselecting promising MOFs before running expensive simulations: RF models can accurately capture uptake behaviors, while TSN classification reliably flags likely low performers and many high performers. Feature-importance analysis clarifies governing factors: available pore volume (VF) strongly controls CO₂ uptake under 10 bar at 298 K, with an optimal VF region (~0.8) beyond which wall interactions diminish and uptake falls; for CH₄, density captures accessible packing constraints; and for TSN, polarity captured by Qf(H₂O) is a better selectivity proxy than single-gas heats of adsorption, reflecting the advantage of features that encode chemical preference for CO₂ over nonpolar CH₄. Transfer to an external, hypothetical MOF set exposed domain shift: the external set contained proportionally more very high VF structures and different feature–target relationships, leading to uptake underpredictions and reduced HIGH-class recall for TSN. This highlights the importance of aligning training and deployment domains, or enriching training data to better span the target feature space. Despite this, the models remain highly effective at eliminating low-TSN candidates, focusing simulation or experimental efforts on a smaller subset likely to contain top performers. Overall, the findings validate ML as a practical accelerator for MOF screening and provide quantitative structure–property insights (e.g., VF optima, pore-size/selectivity interactions, role of polarity) to guide materials design.
Conclusion
This work combines rigorous curation of experimental MOFs, validated GCMC simulations at relevant biogas conditions (10 bar, 298 K, 50/50 CO₂/CH₄), and interpretable ML to predict uptakes and classify MOFs by a trade-off metric (TSN). Random forest models delivered high accuracy for CO₂/CH₄ uptake and strong TSN classification on curated data, enabling rapid prescreening and substantial cost reduction compared with GCMC-only workflows. The study elucidates how structural (especially void fraction) and energetic (Qf(H₂O)) descriptors govern performance and identifies structural motifs among top candidates (e.g., narrow pores, sql or pcu topologies, 2D layering or interpenetration). Prospective tests on a curated hypothetical MOF set demonstrate good low-performer filtering and reveal domain-shift limitations that motivate future expansion and diversification of training data. Future work should broaden coverage of feature/target space (including more hypothetical structures and metals), incorporate framework flexibility where relevant, and explore additional descriptors or multi-fidelity strategies to further improve generalization. Models, code, and data are released publicly to facilitate reuse and extension.
Limitations
- Distribution shift: External hypothetical MOFs occupied different regions of descriptor space (e.g., more very high VF), degrading regression R² and HIGH-class recall in TSN classification. - Label skew and error propagation: TSN depends on selectivity, which involves division by small CH₄ mixture loadings; small errors in CH₄ uptake propagate to large selectivity/TSN errors and skew TSN toward low values, complicating regression for high-TSN cases. - Rigid-framework approximation: Potential MOF flexibility (e.g., layer sliding in 2D or interpenetration effects) was not modeled, which can impact adsorption behavior for some materials. - Force-field and simulation choices: Use of UFF for frameworks, TraPPE for gases, 12.8 Å LJ cutoff (no tail corrections), eQeq charges, and chosen MC settings may introduce systematic errors relative to experiment. - Single-condition focus: Results are for 10 bar and 298 K with 50/50 CO₂/CH₄; performance and descriptor–property relations may differ under other pressures/compositions or process conditions (PSA/TSA/membranes). - Feature scope: Diffusivity-related descriptors were excluded to focus on thermodynamic targets; membrane separation performance was not directly modeled.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny