Engineering and Technology
Exploring high thermal conductivity polymers via interpretable machine learning with physical descriptors
X. Huang, S. Ma, et al.
Polymers are ubiquitous due to chemical inertness, flexibility, and low weight, yet most are thermal insulators (0.1–0.5 W m⁻¹ K⁻¹), limiting heat management in increasingly power-dense organic electronics. Enhancing polymer thermal conductivity (TC) is critical, with morphology and topology (e.g., orientation and crystallinity) known to reduce phonon scattering and boost TC. Individual chains can exhibit very high or even divergent TC, suggesting chain-level design can drive macroscopic performance. Given the vast chemical space and the inefficiency of trial-and-error exploration, the study asks how interpretable machine learning with physically meaningful descriptors can predict and discover high-TC polymer chains, and how chain-level properties relate to amorphous-state heat transport.
Prior work has increased polymer TC via processing (micro-mechanical stretching, electrospinning, templating) and simulations (strain, chain linking, dihedral modulation), indicating that chain order and larger radius of gyration favor TC. Informatics has been applied to optical, electrical, and thermal properties, including discovery of high-TC structures in crystalline, amorphous polymers, and copolymers. Traditional molecular representations use graph descriptors (e.g., Morgan, MACCS, Mol2vec) or fragment statistics, effective for small molecules but less interpretable for polymers where repeating-unit connectivity and physical independence of features are crucial. Physical descriptors from cheminformatics (e.g., Mordred) and force-field inspired parameters (bond/angle/dihedral types and constants) have shown promise (e.g., for specific heat prediction). Feature reduction methods include filter-based correlation metrics (Pearson, Spearman, distance correlation, MIC) and wrapper methods like recursive/sequential feature selection, each with trade-offs. There remains a need for compact, interpretable, physics-grounded descriptors and robust selection strategies tailored to polymer TC prediction.
The framework integrates data curation, molecular dynamics (MD) simulation, physical descriptor engineering with hierarchical down-selection, and ML modeling with interpretability and symbolic regression. Data: 1,735 benchmark polymer monomers collected from literature were used to train models; candidates for virtual screening came from PoLyInfo (12,043) and PI1M (>670,000). UMAP of Morgan fingerprints confirmed benchmark coverage of candidate chemical space. Polymer chains: Monomer SMILES strings were used to construct 1D chains of uniform 50 nm effective length (degree of polymerization determined by monomer length), with GAFF2 parameters assigned via PYSIMM. Thermal conductivity calculation: Chain TC was computed using NEMD in LAMMPS with an enhanced heat-exchange algorithm, applying source/sink regions at chain ends to impose a constant heat flux. Systems were equilibrated (NVT, then NVE) before imposing flux; temperature profiles from the final 2–3 ns yielded TC via Fourier’s law. Cross-sectional area was estimated as van der Waals volume divided by monomer length. Amorphous polymer TC (ATC) and volumetric heat capacity Cv were computed with RadonPy pipelines. Descriptor engineering and down-selection: 320 initial physical descriptors (286 Mordred-based and 34 MD-inspired from GAFF2 force-field parameters and MD-related quantities) were computed. Stage 1 removed low-variance descriptors (threshold 0.10), leaving 264. Stage 2 applied a weighted voting across four correlation metrics—Pearson (threshold 0.05), Spearman (0.05), distance correlation (0.153), and MIC (0.132), each weight 0.25—to select 53 descriptors with cumulative weight 1 (VAM). Stage 3 used Random Sequential Feature Selection (RSFS) with Random Forest (RF): descriptor orders were randomized and RF models trained over 100 cycles; descriptors surpassing a frequency threshold were retained. Balancing model MSE and feature count, a threshold of 0.34 yielded 20 optimized descriptors. Machine learning models: RF, XGBoost, and MLP were trained on log₂(TC) targets; hyperparameters were optimized via Bayesian optimization (100 iterations). Performance across down-selection stages and vs. PCA (>95% variance, ~19 components) was evaluated. Graph descriptors (Mol2vec, MACCS, Morgan, cMorgan) and hybrids with the optimized physical set were also benchmarked; DMPNN (Chemprop) was trained on these representations for comparison. Interpretability: SHAP analysis on RF quantified feature contributions and directions. Virtual screening: RF, XGBoost, and MLP ensembles predicted log₂TC across PoLyInfo and PI1M; candidates exceeding model-specific thresholds (RF ≥ 3.51, XGBoost ≥ 3.50, MLP ≥ 4.33) and appearing in at least two models were selected for MD validation. Symbolic regression: For 107 MD-verified high-TC polymers (TC > 20 W m⁻¹ K⁻¹), genetic programming-based SR (gplearn) created compact formulas relating optimized (and derived) descriptors to log₂TC. Pearson filtering produced 22 descriptors; formulas with R² > 0.6 and complexity ≤ 30 were retained. Pareto-front formulas were selected via Latin hypercube sampling, yielding explicit equations emphasizing cross-sectional area and dihedral stiffness. Phonon-SED analysis: Eight representative polymers underwent SED calculations to extract acoustic-branch group velocities; volumetric heat capacities were taken from amorphous simulations, enabling mean free paths via l = k/(v_g C_v). Linkage to amorphous state: For 58 selected high-TC chains, corresponding amorphous ATC was computed; radius of gyration and energy flux decomposition (bond, angle, dihedral, convection, nonbonded, improper) analyses elucidated intra- vs inter-chain contributions.
• Descriptor reduction: From 320 physical descriptors to 20 optimized features via variance filtering, multi-metric correlation voting, and RSFS, preserving interpretability. • Model performance: Using the 20 optimized descriptors, RF achieved training/test R² ≈ 0.87/0.84; XGBoost ≈ 0.95/0.87; MLP ≈ 0.81/0.88. These outperform models using graph descriptors (Mol2vec, MACCS, Morgan, cMorgan). PCA-derived components (>95% variance) underperformed the RSFS-selected physical features. • Interpretability: SHAP identified cross-sectional area as the most important descriptor, negatively correlated with TC; Kd_average (average dihedral force constant) positively correlated with TC. MW_ratio (main-chain to monomer mass ratio) positively associated with TC (side-chain suppression). • Screening results: 107 polymers with TC > 20.00 W m⁻¹ K⁻¹ were identified and validated by MD (24 from PoLyInfo, 84 from PI1M, 107 total after deduplication). Synthetic Accessibility (SA) scores ranged 1–10; 28 had SA ≤ 3.00. Example MD TCs: polyethylene (PE) 38.98 W m⁻¹ K⁻¹; poly(p-phenylene) 55.94; poly[(E)-1-fluoroethene-1,2-diyl] 52.21; polyacetylene-like []C=C[] 147.68; []N=N[] 1028.85 (noting MD limitations for Cv/mean free path for []N=N[]). • Physical insights: High-TC candidates predominantly π-conjugated, featuring rigid backbones with overlapping p-orbitals, yielding large acoustic group velocities (>5900 m/s for six conjugated polymers). Volumetric heat capacities among eight exemplars ranged ~2.70–3.74 J cm⁻³ K⁻¹; group velocities were reduced for heavier atoms (e.g., fluorine in PTFE vs PE). Simple linear and conjugated chains showed long phonon mean free paths and high k. • Symbolic regression: Thousands of candidate formulas were generated; Pareto-optimal equations (complexity 20–30) achieved R² > 0.70 and consistently included cross-sectional area, Kd_average, and Nd_average, enabling fast, model-agnostic TC estimation for promising candidates. • Chain–amorphous linkage: For 58 amorphous polymers derived from high-TC chains, about half had ATC > 0.40 W m⁻¹ K⁻¹ (vs ~2.3% in a large reference set). ATC correlated positively with radius of gyration, and energy flux decomposition showed intra-chain terms (bond, angle, dihedral) dominate ATC, with dihedral contributions especially significant in π-conjugated systems. Heavier atoms (e.g., F) suppressed phonon transport and ATC.
The study addresses the challenge of efficiently discovering high-TC polymers by coupling physics-grounded descriptors with interpretable ML and validation via MD. By emphasizing physically independent and interpretable features, the models achieve high accuracy while providing mechanistic insights (e.g., the critical roles of minimizing chain cross-section and maximizing dihedral stiffness to enhance backbone heat transport). The approach demonstrates that carefully engineered physical descriptors can outperform standard graph-based fingerprints for polymer TC prediction and generalize to diverse ML architectures. The screening uncovers many π-conjugated candidates, aligning with the hypothesis that rigid, conjugated backbones support large group velocities and reduced rotational disorder. By computing ATC and structural metrics for amorphous polymers derived from selected high-TC chains, the work links chain-level properties to bulk-like behavior, showing that strong intra-chain interactions and larger rigid segments (high radius of gyration) elevate amorphous TC. Symbolic regression yields compact formulas capturing these trends, enabling quick, model-independent screening. Collectively, the results validate the pipeline as an effective route for understanding and discovering high-TC polymers and for relating chain-scale design to amorphous-state performance.
An interpretable ML framework was developed to discover high thermal conductivity polymers using compact, physics-based descriptors and high-throughput MD validation. A hierarchical feature selection reduced 320 initial descriptors to 20 highly informative ones, enabling RF, XGBoost, and MLP models to achieve test R² ≥ 0.84 and to outperform graph descriptors. SHAP analysis revealed that smaller cross-sectional area and higher dihedral stiffness promote TC. The ensemble screening identified 107 polymers with TC > 20 W m⁻¹ K⁻¹ (28 with SA ≤ 3.00), largely π-conjugated, which were validated by MD. Symbolic regression produced explicit, accurate formulas for predicting log₂TC of promising candidates without resorting to ML inference. Phonon-SED and amorphous simulations established that high group velocities, strong intra-chain interactions, and larger rigid segments underpin enhanced thermal transport in both chains and amorphous states. Future work could expand datasets for broader generalization, integrate more advanced graph neural models with physics-informed features, incorporate processing parameters (e.g., strain, alignment) into predictions, and experimentally validate top candidates via scalable fabrication routes (e.g., electrospinning, templating).
• Dataset limitations and imbalance: Few high-TC examples relative to the broader chemical space may constrain model extrapolation and bias performance metrics; DMPNN performance was limited by data volume. • Representation scope: Although physical descriptors improved interpretability and accuracy, they may not capture all subtle structural/topological effects; the third affiliation footnote for one author and some monomer connection subtleties are not fully detailed in the text. • MD and modeling assumptions: NEMD-derived TC and SED analyses rely on classical force fields (GAFF2) and finite-size/time approximations; Cv and mean free path estimates are approximate; length-dependent divergence and finite-size effects can influence k. • Experimental accessibility: Pure single-chain measurements are not currently feasible; several top candidates (e.g., highly conjugated/linear species) may be challenging to synthesize or process; SA scores provide only a heuristic estimate of synthetic difficulty. • Extrapolation: Tree-based models have limited extrapolation beyond the training domain; neural networks extrapolate better but may overpredict; ensemble thresholds mitigate but do not eliminate risk.
Related Publications
Explore these studies to deepen your understanding of the subject.

