logo
Loading...
Property-guided generation of complex polymer topologies using variational autoencoders

Engineering and Technology

Property-guided generation of complex polymer topologies using variational autoencoders

S. Jiang, A. B. Dieng, et al.

This research by Shengli Jiang, Adji Bousso Dieng, and Michael A. Webb delves into the innovative application of variational autoencoders to generate intricate polymer topologies tailored for specific properties. With a diverse dataset of polymers, their model, TopoGNN, not only predicts key characteristics but also opens new opportunities for engineered polymers using machine learning.... show more
Introduction

Polymer topology (chain architecture) strongly influences material properties, yet establishing general topology–property relationships remains difficult due to synthetic effort, characterization cost, and computational limitations. Prior experimental and simulation studies have linked topology to applications in rheology, coatings, energy storage, and biomedicine, but often focus on a narrow set of systems or a single topology class, hindering broad design rules. Generative machine learning, particularly variational autoencoders (VAEs), has shown promise in molecular and polymer design, but existing polymer-focused efforts largely target linear architectures or specific chemical spaces. The central question of this study is whether a generative VAE trained on simulation-derived data can learn an informative latent representation of diverse polymer topologies and be guided to generate new, valid topologies that achieve specified target properties (e.g., radius of gyration), enabling controlled comparisons of other properties such as rheology. The purpose is to build and assess a multi-task VAE framework that reconstructs polymer graphs, predicts polymer size, and classifies topology, and to demonstrate its utility by generating topologies matched in size but distinct in architecture to isolate topological effects on rheological behavior.

Literature Review

The study situates itself within prior work on polymer topology and ML-driven design. Traditional synthetic advances enable complex architectures (stars, combs, dendrimers, rings), while experiments and simulations have explored impacts on rheology and other properties but with limited breadth. Generative models, including VAEs, have been used for small molecules and are emerging in polymer science (e.g., guiding peptide assembly, Open Macromolecular Genome). However, most prior polymer ML work emphasizes linear chains or constrained chemical spaces. This gap motivates models that can represent, classify, and generate architecturally diverse polymers and couple topology with target properties.

Methodology

Dataset: 1342 coarse-grained polymer graphs spanning six architectures (linear, cyclic, branch (αω-branched), comb, star, dendrimer). Each polymer has 90–100 beads. Dataset split: 64/16/20 (train/validation/test) with stratified sampling across classes. Topological descriptors: 11 graph-derived features per polymer—number of nodes, number of edges, average degree, average neighbor degree, graph density, diameter, radius, algebraic connectivity, degree centrality, betweenness centrality, degree assortativity. The dataset is limited to architectures with at most one cycle (macrocycle in cyclic polymers) and no networks. Property computation: Coarse-grained MD (Kremer–Grest model; implicit athermal solvent; Langevin dynamics; velocity-Verlet, timestep 0.001; friction 0.1). Single-chain simulations (2×10^7 steps; last half for sampling) used to compute ensemble-averaged squared radius of gyration Rg^2 from the gyration tensor. Many-chain simulations (100 chains; periodic box; concentrations 0.1–0.8) used for rheology: stress relaxation modulus G(t) via Green–Kubo from stress autocorrelation, fit to generalized Maxwell model to obtain viscosity η=ΣGpτp; storage and loss moduli G′(ω), G″(ω) from Fourier transform of G(t). Polymer representation and preprocessing: Polymers represented as graphs G=(V,E) with adjacency matrix A∈R^{100×100}. To standardize node count, graphs are padded to 100 nodes with “ghost” nodes (zero degree). Node features X∈R^{100×100} are adjacency vectors due to equivalent beads; no edge features. Each polymer is associated with an 11-D descriptor vector t, a scalar label y_r=Rg^2 for regression, and a one-hot topology label y_t for classification. Models: Three encoder strategies with a common decoder. (i) TopoGNN: multi-input encoder combining a Graph Isomorphism Network (GIN) encoder (2 GINConv layers to 32-D hg) and a dense NN encoder for descriptors (32-D ht); concatenated h∈R^{64} passed through dense layers to produce latent Gaussian parameters μ, logσ^2 (latent z∈R^8). (ii) GNN: graph-only encoder. (iii) Topo: descriptor-only encoder. Decoder: convolutional NN reconstructs adjacency matrix  from z. Two auxiliary heads from z: a regressor for ŷ_r (Rg^2) and a classifier for ŷ_t (topology). Symmetry of reconstructed adjacency is enforced post hoc by A_sym = max(A, A^T). Training objective: Composite VAE loss L = L_rec (BCE on adjacency) + λ_KL L_KL + λ_Reg L_Reg (MAE on Rg^2) + λ_Cls L_Cls (cross-entropy on topology). Training details: Implemented in TensorFlow; 1000 epochs; Adam optimizer. Hyperparameters explored across batch sizes {32,64,128}, learning rates {1e-4,1e-3,1e-2}, and weights λ_Reg, λ_Cls ∈ {0.01,0.1,1,10,100}. 2025 combinations evaluated; selection via a composite evaluation score (distance to origin after min–max normalization of metrics: 1−BACC, KL, 1−R^2, 1−F1). Latent space visualization via UMAP (neighborhood=200, min distance=1, Euclidean metric). Generation workflows: Random generation—sample z from prior, decode to A_gen, apply graph cleansing (remove isolated nodes, break small rings), re-encode and validate candidates by requiring small ΔRg^2 (<2σ^2), unchanged topology class, and small latent shift (MSE<1). Targeted generation—identify “parent” polymers near target Rg^2 (±2) and desired topology; sample around parent z with Gaussian noise (mean 0, var 0.1); screen candidates by predicted topology and Rg^2 tolerance; cleanse and revalidate; remove duplicates via graph isomorphism checks. Diversity quantification via Vendi Score on Laplacian spectra (zero-padded to length 100; similarity by dot product). Rheology analysis: For selected size-matched topologies (Rg^2≈30±2), compute viscosity vs concentration and frequency-dependent moduli to isolate topology effects independent of dilute-solution size.

Key Findings
  • Multi-task VAE performance: On validation, TopoGNN achieved Balanced Accuracy 0.9439, Regression R^2 0.9915, F1 0.9953, KL 18.7244, giving the smallest composite distance (0.3829). GNN: BACC 0.9448, R^2 0.9634, F1 0.9768, KL 15.6018, distance 0.8348. Topo: BACC 0.9281, R^2 0.9949, F1 0.9953, KL 16.0418, distance 0.3992.
  • Test-set trends: Reconstruction balanced accuracy highest for GNN (0.9397), followed by TopoGNN (0.9369) and Topo (0.9164). For Rg^2 regression, TopoGNN best (mean R^2 0.9920), surpassing Topo (0.9854) and GNN (0.9639). Mean F1 highest for GNN (0.9783), with TopoGNN (0.9689) and Topo (0.9678) comparable.
  • Latent space structure: Auxiliary tasks (Rg^2 regression and topology classification) produce an interpretable latent space where increasing Z2 monotically increases Rg^2, while changing Z1 transitions among topology clusters with minor Rg^2 variation. Removing auxiliary tasks disrupts this organization.
  • Diversity: Generated topologies are more diverse than the handcrafted dataset. Vendi Score: original dataset (1342) = 2.0968; TopoGNN-generated (1342) = 5.0684 (exceeding GNN 4.9580 and Topo 4.3305).
  • Property-guided generation: Targeting Rg^2 ranges 7.5±2, 30±2, 50±2 yields distinct sets of valid polymers that match targets upon MD validation. Low Rg^2 predominantly yields dendrimers and stars; mid-range yields branch, comb, cyclic, and star; high range mostly branch and comb. Some architectures (e.g., dendrimers) are constrained to lower sizes by bead-count limits; few polymers meet the highest target due to sparse training data near Rg^2≈50.
  • Saliency/correlations: Graph diameter, betweenness centrality, and algebraic connectivity most strongly influence Rg^2 predictions, consistent with direct correlations.
  • Rheology at matched size (Rg^2≈30±2): Cyclic polymers show lower viscosity at higher concentrations due to reduced entanglement (no free ends); αω-branched polymers show higher viscosities due to long side chains; star and comb exhibit similar but somewhat lower viscosities than αω-branched, reflecting side-chain placement and density effects. Frequency-dependent moduli show multiple crossover frequencies for star/branch/comb at higher concentrations, while cyclic systems tend to maintain a single crossover, indicating less nuanced viscoelastic behavior.
Discussion

The study demonstrates that combining graph-explicit features with graph-derived topological descriptors in a multi-task VAE yields a physically meaningful latent representation across diverse polymer architectures. The auxiliary tasks (Rg^2 regression and topology classification) guide the latent structure so that size and architecture vary along distinct directions, enabling property-guided exploration. This organization allows controlled generation of polymers that meet a target size while varying topology, thereby decoupling dilute-solution size from topology to isolate topological effects on rheology. The generated sets, validated by MD, confirm that the model can propose novel, diverse, and valid topologies beyond the original dataset, including irregular dendrimers and nuanced branching patterns. Rheological analyses at comparable Rg^2 reveal expected and nuanced architecture-dependent behaviors (e.g., lower viscosities for cyclic, elevated for αω-branched), underscoring the utility of the framework for designing polymers to tune viscoelastic response. Overall, the findings address the research question by showing VAEs can effectively reconstruct, classify, and generate complex polymer topologies with target properties and can serve as engines for hypothesis-driven studies across topology classes.

Conclusion

This work introduces a multi-task VAE (TopoGNN) trained on coarse-grained MD data to encode, reconstruct, classify, and generate complex polymer topologies conditioned on target properties (Rg^2). TopoGNN outperforms graph-only or descriptor-only models overall, learns an interpretable latent space, and generates valid, diverse topologies that expand beyond the handcrafted dataset. By producing size-matched yet architecturally distinct polymers, the approach enables controlled studies of topology-dependent properties such as viscosity and viscoelastic moduli. Future directions include: extending targets beyond Rg^2 to additional properties; broadening molecular-weight ranges and exploring transferability (e.g., via string-based representations); incorporating compositional/chemical heterogeneity; improving CG model fidelity and dynamical consistency; integrating hydrodynamic interactions where relevant; and representing and generating ensembles that reflect synthetic realities. These advances will further bridge ML-driven topology design with experimentally synthesizable polymer systems.

Limitations
  • Dataset scope: Polymers have 90–100 beads; linear and cyclic classes include only 11 distinct examples each; architectures limited to at most one cycle; no polymer networks.
  • Model physics: Coarse-grained Kremer–Grest model with implicit athermal solvent neglects hydrodynamic interactions; CG parameter transferability and dynamical consistency can be limited across conditions.
  • Chemical/compositional simplicity: All beads are equivalent (chemically homogeneous); edge features omitted; compositional complexity not addressed.
  • Generation constraints: Molecular-weight restriction limits achievable Rg^2 for some architectures (e.g., dendrimers mainly at low sizes); relatively few candidates meet the highest Rg^2 target due to data sparsity; adjacency symmetry not enforced during training (only post hoc).
  • Synthetic realism: Models feature precisely defined architectures, whereas real synthesis yields ensembles; future work must address representation and prediction for realistic structural distributions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny