
Chemistry
Representation of molecular structures with persistent homology for machine learning applications in chemistry
J. Townsend, C. P. Micucci, et al.
Discover how cutting-edge machine learning techniques are transforming the screening of functionalized molecules and materials! This innovative research, conducted by Jacob Townsend, Cassie Putman Micucci, John H. Hymel, Vasileios Maroulas, and Konstantinos D. Vogiatzis, showcases a unique molecular representation based on persistent homology, leading to the identification of promising CO₂-selective molecules from a vast database of organic compounds.
~3 min • Beginner • English
Introduction
The increasing concentration of greenhouse gases has been identified as a primary factor of many facets of environmental degradation such as higher global temperature, rising sea levels, increased ocean acidity, and more extreme weather-related events. CO₂ is the most prominent greenhouse gas, and its atmospheric concentration has exceeded 400 ppm, which is more than a 40% increase from pre-industrial conditions, potentially leading to a rise in global temperatures of more than 2°C by the year 2100. Lowering CO₂ emissions is therefore mandatory to meet ambitions to limit temperature increases to 1.5 °C. Advancements in carbon capture and storage technology are desired for meeting these goals of lowered atmospheric greenhouse gas emissions and reduced global temperature increases year to year. At an industrial level, liquid amine-based solvents are used for separation and capture of CO₂ via chemisorption, but the solvent regeneration step is an energy intensive process. Membrane-based technologies offer an alternative, cost-effective process for CO₂, utilizing weaker noncovalent interactions to achieve separations. Materials explored include polymeric membranes and crystalline porous materials such as MOFs and zeolites.
The understanding of how atomistic structure affects gas selectivities, particularly CO₂/N₂, is crucial. Functional groups that selectively interact with CO₂ can improve membrane performance. However, the number of potential CO₂-philic groups is vast, making exhaustive ab initio studies impractical. High-throughput computational screening and machine learning (ML) can accelerate discovery by learning structure–property relationships and predicting properties with reduced cost. The efficiency of ML depends critically on the molecular representation used to encode structures into machine-readable features. Established representations include the Coulomb matrix, Bag-of-Bonds, FCHL, and SOAP, each with trade-offs in accuracy and computational cost. Here, a new molecular representation based on persistent homology is proposed to encode 3D structural information into 2D persistence images that, when chemically informed, can enhance predictive performance and provide constant-size feature vectors desirable for generalization. The representation is applied to identify CO₂-philic functional groups by predicting CO₂ and N₂ interaction energies.
Literature Review
Several molecular representations have been developed for chemical ML tasks. The Coulomb matrix (CM) encodes atom–atom interactions via diagonal atomic self-energies and off-diagonal Coulombic terms. Bag-of-Bonds (BoB) improves upon CM by grouping pairwise interactions by element pairs and sorting them. FCHL uses Gaussian distribution functions within a kernel ridge regression framework for quantum ML. SOAP represents local atomic environments via atomic density overlaps but at increased computational cost compared to pairwise descriptors. Recent advances in molecular representations have improved ML performance for predicting quantum properties, structure–function relations, and reaction barriers, highlighting the importance of representation choice for model accuracy and generalization.
Methodology
The method uses persistent homology to construct persistence diagrams (PDs) from molecular geometries and then vectorizes them into chemically driven persistence images (PIs).
1) Persistence diagram construction: For a given molecule, spheres are centered at each atom with radius initially zero. As the radius increases, spheres intersect, creating and merging homological features. Zero-dimensional features (connected components) encode interatomic distances; one-dimensional features (holes) encode rings and functional groups. Each feature has a birth (x-axis) and persistence (lifetime, y-axis) recorded in the PD. Example: in anisole, connected components corresponding to C–H (~1.1 Å), C–O, and C–C (~1.4 Å) are identified; holes form for the phenyl ring and methoxy-related structure as spheres overlap.
2) Vectorization to persistence images: A PI is formed by placing Gaussian kernels at each PD point; pixel intensities correspond to multiplicities (e.g., counts of equivalent bonds). This yields a stable, tractable, two-dimensional image encoding molecular geometry.
3) Chemical embedding: Purely topological PIs cannot distinguish molecules with similar geometries but different element types (e.g., HBr vs F₂ with similar bond lengths). To incorporate chemistry, the variance of the Gaussian kernels is modulated by atom-type information—specifically, by the pairwise electronegativity difference for connected components. Larger electronegativity differences yield larger kernel variance, differentiating polar from nonpolar bonds and producing unique PIs for chemically distinct molecules with similar geometries.
4) Similar-size representation: Empirically, PD/PI extents remain of the same order across system sizes (e.g., anisole 3 Å × 3 Å, tert-butylcalixarene ~4 Å × 4 Å, SARS-CoV-2 main protease complex ~6 Å × 6 Å), enabling a constant-size vectorization irrespective of atom count, which is beneficial for generalization and computational efficiency.
5) Data and geometries: To avoid DFT geometry optimization bottlenecks in high-throughput settings, initial 3D geometries are generated using OpenBabel (gen3d). Target properties are interaction energies with CO₂ and N₂ computed by DFT for training/validation.
6) Model training and comparison: For an initial set of 100 diverse organic molecules, multiple representations (PI, CM, BoB, FCHL, SOAP) are generated and paired with standard ML algorithms (random forest, Gaussian process regression, kernel ridge regression). Hyperparameters are optimized (see supplementary details). Two models per representation are trained (one each for CO₂ and N₂ interaction energies). Performance is evaluated via 10-fold cross-validated RMSE.
7) Active learning: To address training-set bias and cover GDB-9 chemical space, an iterative active-learning loop is performed per representation. In each iteration, the top 40 molecules predicted to have the strongest CO₂ interactions are selected from GDB-9, evaluated with DFT, and added to the training set. Three iterations expand the dataset from 100 to 220 molecules per representation.
8) Large-scale screening: Using the final models trained on 220 molecules, the entire GDB-9 database (133,885 molecules with up to nine heavy atoms) is screened for both CO₂ and N₂ interaction energies. Predicted top candidates are validated with DFT. Computational efficiency is benchmarked (e.g., SOAP vs PI runtime on CPU).
Key Findings
- Predictive accuracy on initial 100-molecule set (10-fold CV RMSE for CO₂ interaction energies): CM 0.63 kcal mol⁻¹; BoB 0.52 kcal mol⁻¹; FCHL 0.50 kcal mol⁻¹; PI with kernel ridge regression (Laplacian kernel) 0.44 kcal mol⁻¹; SOAP with kernel ridge regression (linear kernel) 0.41 kcal mol⁻¹, but PI showed tighter RMSE variance (higher confidence). Similar performance trends were observed for N₂.
- Active learning effectiveness (PI representation): Number of promising structures (CO₂ interaction energy < −6.0 kcal mol⁻¹) increased from 10 (first iteration) to 43 (second) and 75 out of 120 (third) among the newly added molecules, indicating systematic enrichment of strong binders.
- GDB-9 screening outcomes: All methods predict most molecules have CO₂ interaction energies between −3.0 and −5.0 kcal mol⁻¹ and N₂ near −2.0 kcal mol⁻¹. Only the PI-based model identified 4,5-diamino-1H-imidazol-2-ol among the strongest CO₂ binders (DFT CO₂ interaction energy −7.41 kcal mol⁻¹). Overall, 44 molecules were identified by PI screening with CO₂ interaction energies stronger than −6.5 kcal mol⁻¹; DFT validations confirmed these findings.
- Selectivity insights: Cooperative effects between N-containing heterocycles and ortho-positioned amino or hydroxo groups enhance CO₂ affinity via induced dipole interactions and hydrogen bonding.
- Computational efficiency: Screening the full GDB-9 database required 88,287 s with SOAP versus 2,219 s with PI on an Intel i5-4278U CPU (~40× faster for PI), while maintaining strong predictive performance.
- Data quality transfer: Training CM, BoB, or SOAP models on molecules identified via PI active learning yielded more concise prediction distributions and correctly highlighted top CO₂-philic candidates, indicating that PI-driven active learning provides higher-quality training data for multiple representations.
Discussion
The study addresses the challenge of efficiently and accurately predicting gas–molecule interaction energies by introducing a chemically informed, topology-based molecular representation. By embedding electronegativity differences into persistence images, the method preserves geometric-topological information while capturing elemental chemistry, enabling discrimination between molecules with similar geometries but different bonding character. This representation, coupled with active learning, effectively discovers rare CO₂-philic molecules in a vast chemical space (GDB-9), aligning with the goal to identify structures with strong CO₂ and weak N₂ interactions for membrane-based separations. The improved generalization from constant-size features and robustness to conformational variability contribute to lower prediction uncertainty. Empirically, PI-based models achieved low RMSE with tighter variance, identified top-performing candidates (e.g., 4,5-diamino-1H-imidazol-2-ol, −7.41 kcal mol⁻¹), and demonstrated superior computational efficiency compared to SOAP, facilitating practical large-scale screens. The findings suggest PI representations can streamline discovery workflows and serve as strong general-purpose descriptors for broader chemical ML tasks.
Conclusion
A novel, chemically driven persistence image representation was introduced and validated for predicting DFT-quality CO₂/N₂ interaction energies. The approach combines persistent homology with atomic electronegativity-informed kernel variance to yield concise, constant-size, and computationally efficient molecular fingerprints. In head-to-head comparisons, PI achieved competitive or superior accuracy with tighter uncertainty than alternative representations and enabled rapid screening of GDB-9, uncovering 44 strong CO₂-binding molecules (≤ −6.5 kcal mol⁻¹) and specific binding motifs (e.g., N-heterocycles with ortho amino/hydroxo groups). Active learning with PI systematically enriched the training set with rare strong binders, improving model reliability across iterations. Given its robustness and scalability, the PI representation is poised for application to larger supermolecular systems and other domains, including catalysis and lanthanide/actinide separations. Future work includes incorporating additional PI features (e.g., intensity normalization), and leveraging data from smaller molecules to predict properties of larger systems.
Limitations
- The initial training set (100 molecules) was biased toward N-containing heterocycles with small functionalizations, necessitating active learning to adequately cover the GDB-9 chemical space.
- Purely topological PIs cannot distinguish element identities; the method relies on a domain-specific augmentation (electronegativity-difference-based variance) to encode chemistry. The choice and parameterization of such chemical embeddings may influence performance.
- While the workflow avoids DFT geometry optimizations by using OpenBabel-generated structures for input, high-level DFT calculations are still required for labeling during model training and validation, which can be a bottleneck when expanding datasets beyond the sizes explored here.
- Detailed computational settings for DFT and ML hyperparameters are provided in the Supplementary Information; model performance may vary with these choices and between property domains beyond CO₂/N₂ interactions.
Related Publications
Explore these studies to deepen your understanding of the subject.