Chemistry
TransPolymer: a Transformer-based language model for polymer property predictions
C. Xu, Y. Wang, et al.
The study addresses the challenge of accurately and efficiently predicting polymer properties, a critical need for applications such as electrolytes, optoelectronics, and energy storage. Traditional experimental and simulation methods are costly and slow. Prior machine learning approaches include fingerprint-based models, CNNs, RNNs, and GNNs. However, GNNs require explicit structural/conformational information and struggle with variable degrees of polymerization; sequence models with RNNs often fail to capture long-range dependencies in polymer sequences. Despite the success of Transformer-based language models in NLP and small-molecule chemistry, they have not been applied to polymers, which pose unique challenges: standard SMILES does not capture polymer structure, degree of polymerization, or measurement conditions; input sequences must encode monomer identities and arrangements; and labeled data are scarce. The research question is whether a Transformer-based language model with a chemically-aware tokenizer and self-supervised pretraining can learn effective representations from polymer sequences to deliver state-of-the-art performance on diverse polymer property prediction tasks and provide insights via attention mechanisms.
The paper reviews: (1) Traditional fingerprint approaches (e.g., ECFP) adapted to polymers; (2) Deep learning methods in polymers: CNNs for mechanical properties of polymer-CNT interfaces, GNNs/GCNNs for thermal and mechanical properties, and ensemble GNN methods for conjugated polymers’ EA/IP. GNNs require explicit structural/conformational info and have difficulties representing polymers with variable polymerization degrees. (3) Sequence-based models: LSTM/RNNs applied to coarse-grained polymer genomes, copolymers, and morphology predictions; SMILES-based polymer encodings (e.g., BigSMILES) used with LSTMs for Tg prediction; but RNNs struggle with long dependencies. (4) Transformer successes in NLP and chemistry: BERT/RoBERTa/GPT/ELMo/XLM pretraining, SMILES-BERT and ChemBERTa for molecular properties, Transformer for reaction prediction, and structure-agnostic applications in materials (e.g., MOFs). The gap identified is the lack of Transformer-based models tailored to polymers with sequence representations encoding polymer-specific descriptors and composition/arrangement information.
Model: TransPolymer combines a chemically-aware tokenizer, a Transformer encoder (RoBERTa-like), and a one-layer MLP regressor head. The encoder has 6 hidden layers with 12 attention heads per layer. The special token at sequence start ('') provides the pooled representation to the regressor.
Polymer tokenization: Repeating units are represented by SMILES with '*' indicating polymerization points; '.' separates constituents (e.g., in copolymers or mixtures); 'A' can indicate branches. The tokenizer recognizes chemical tokens (elements like 'Si' as single tokens), descriptor values (e.g., temperature, degree of polymerization, polydispersity, chain conformation), and special separators as single tokens. Copolymers include SMILES of each repeating unit, their ratios, and arrangement. Mixtures/materials concatenate component sequences plus material-level descriptors. Missing descriptor values are encoded with descriptor-specific NAN tokens (e.g., 'NAN_Tg'). Descriptor values are discretized and tokenized.
Datasets: Pretraining uses PI1M (~1M polymer-like sequences generated to mirror PolyInfo space), augmented to ~5M via SMILES augmentation (non-canonical renumberings). Finetuning evaluates ten benchmarks covering: polymer electrolyte conductivity (PE-I, PE-II), electronic properties (Egc chain bandgap, Egb bulk bandgap, Eea electron affinity, Ei ionization energy), crystallization tendency (Xc), dielectric constant (EPS), refractive index (Nc), and OPV power conversion efficiency (PCE). Inputs vary: some datasets use only polymer SMILES; others include materials descriptors (e.g., temperatures, ratios). Data splits follow original works: PE-I is year-based split; others 5-fold cross-validation.
Data augmentation: SMILES augmentation (RDKit) removes canonicalization, reindexes atoms, reconstructs grammatically correct SMILES preserving isomerism and avoiding Kekulization. Duplicates removed. Applied to pretraining (5x per entry, ~5M total) and to training partitions of downstream datasets (with limits for long-SMILES datasets to prevent distribution shifts and manage compute).
Pretraining (MLM): 15% tokens selected; of these, 80% masked, 10% random tokens, 10% unchanged. Optimizer: AdamW (lr 5e-5, betas 0.9/0.999, eps 1e-6, weight decay 0), linear scheduler with 0.05 warmup ratio, batch size 200, dropout (hidden/attention) 0.1, 30 epochs. Best validation model retained. Training time ~3 days on two RTX 6000 GPUs. Pretraining/validation split 80/20.
Finetuning: Pretrained encoder + one-layer MLP head (SiLU activation). Optimizer AdamW (betas 0.9/0.999, eps 1e-6, weight decay 0.01). Different learning rates for encoder/head; layer-wise learning rate decay (LLRD) applied in some experiments to use larger lr for higher layers and smaller for lower layers. 20 epochs per dataset; best model selected by test RMSE/R². Evaluation metrics: RMSE and R²; cross-validation metrics averaged across folds. Baselines include RF with ECFP, LSTM sequence models, and original-study baselines (e.g., GPs, descriptor-based models, polymer genome fingerprints).
Analyses: t-SNE visualization of embeddings to show coverage of downstream chemical space by pretraining data; ablation on pretraining data size (5K/50K/500K/1M vs 5M augmented); comparison of finetuning strategies (head-only vs full-model); effect of data augmentation in finetuning; attention map visualization to interpret token-token relations and the influence of descriptors (e.g., Tg) in predictions.
-
Overall performance: TransPolymer (pretrained) achieves state-of-the-art results across all ten benchmarks, outperforming baselines and the unpretrained variant.
-
PE-I (polymer electrolyte conductivity; year-based split):
- TransPolymer_pretrained: Train RMSE 0.20, Test RMSE 0.67 (log S·cm⁻¹), Train R² 0.98, Test R² 0.69.
- Best baseline RF (ECFP): Test RMSE 1.00, Test R² 0.32. GP(GNN FP) Test R² 0.16; other baselines often overfit.
- Improvement vs best baseline: −30.9% RMSE, +0.37 R².
-
PE-II (polymer electrolyte conductivity; 5-fold CV):
- TransPolymer_pretrained: Test RMSE 0.61 (log), Test R² 0.73.
- Best baseline (Extra Trees, chemical descriptors): Test RMSE 0.63, R² 0.72; RF(ECFP) underperforms (RMSE 0.94, R² 0.27).
- Improvement vs best baseline: −3.17% RMSE, +0.01 R².
-
Electronic properties (Kuenneth et al. datasets; 5-fold CV): Test metrics for TransPolymer_pretrained vs baselines:
- Egc (bandgap, chain): RMSE 0.44 eV, R² 0.92 (best GP(PG) R² 0.90). Improvement: −8.33% RMSE, +0.02 R².
- Egb (bandgap, bulk): RMSE 0.52 eV, R² 0.93 (best GP(PG) R² 0.91). Improvement: −5.45% RMSE, +0.02 R².
- Eea (electron affinity): RMSE 0.32 eV, R² 0.91 (best GP(PG) R² 0.90). Improvement: 0% RMSE, +0.01 R².
- Ei (ionization energy): RMSE 0.39 eV, R² 0.84 (best GP(PG) R² 0.77). Improvement: −7.14% RMSE, +0.07 R².
- Xc (crystallization tendency): RMSE 16.57%, R² 0.50. Baselines had R² < 0 (poor). Improvement: −20.1% RMSE, +0.50 R².
- EPS (dielectric constant): RMSE 0.52, R² 0.76 (best GP(PG) R² 0.68). Improvement: −1.89% RMSE, +0.05 R².
- Nc (refractive index): RMSE 0.10, R² 0.82 (best GP(PG) R² 0.79). Improvement: 0% RMSE, +0.03 R².
-
OPV (p-type polymer solar cells; 5-fold CV):
- TransPolymer_pretrained: Test RMSE 1.92% PCE, Test R² 0.32; slightly better than RF(ECFP) (RMSE 1.92, R² 0.27) and ANN(ECFP) (RMSE 2.03, R² 0.20).
- Improvement vs best baseline: 0% RMSE, +0.05 R².
-
Aggregate improvements (Table 6):
- Versus best baselines: average −7.70% test RMSE and +0.11 R².
- Versus TransPolymer_unpretrained: average −18.5% test RMSE and +0.12 R².
-
Ablations:
- Pretraining size: Larger pretraining corpora improve downstream RMSE and R²; small pretraining sets (5K–50K) can hurt performance vs training from scratch on some datasets (e.g., PE-I, Nc, OPV), likely due to distribution mismatch; t-SNE shows downstream data coverage improves with larger pretraining sets (1M→5M augmented).
- Finetuning strategy: Freezing encoder and training head-only yields reasonable but inferior performance; full-model finetuning significantly improves all tasks (e.g., PE-I R² from 0.12 to 0.69; Egc R² from 0.81 to 0.92).
- Data augmentation in finetuning: Generally boosts performance; without augmentation, improvements vs best baselines are smaller or negative on data-scarce sets (e.g., PE-II). With augmentation, average improvements align with Table 6.
-
Attention analyses: Early layers show strong local token attention (near-diagonal), consistent with chemical bonding locality; deeper layers’ attention becomes more uniform/contextualized. In finetuned models, the prediction token attends to chemically meaningful tokens (e.g., special tokens and Tg values) in electrolyte sequences, aligning with known property determinants.
The findings demonstrate that a Transformer-based, chemically-aware language model can learn robust, transferable representations from polymer sequences. By integrating polymer-specific tokenization and MLM pretraining on large unlabeled corpora, TransPolymer generalizes across diverse tasks, materials types (homopolymers, copolymers, mixtures), and noisy datasets (e.g., PE-I), addressing the limitations of RNNs (long-range dependencies) and structure-dependent GNNs (need for explicit conformations and degree of polymerization). Significant gains on challenging datasets like Xc and PE-I indicate improved handling of heterogeneous compositions and auxiliary descriptors (temperature, ratios). Ablations confirm that both sufficient pretraining data and full-model finetuning are crucial for performance, while data augmentation mitigates data scarcity typical in polymer informatics. Attention visualizations suggest the model captures chemically meaningful relationships and focuses on influential descriptors (e.g., Tg) during prediction, supporting interpretability to a degree. Overall, the approach answers the research question by delivering consistent state-of-the-art results and providing a generalizable pretraining-finetuning pipeline for polymer property prediction.
The paper introduces TransPolymer, the first Transformer-based language model tailored for polymers, featuring a chemical-aware tokenizer and MLM pretraining on ~5M unlabeled sequences. Across ten benchmarks, TransPolymer outperforms baseline models and its unpretrained counterpart, with notable improvements in RMSE and R², especially for complex, noisy datasets. Ablation studies underscore the importance of large-scale pretraining, full-model finetuning, and data augmentation. Attention analyses indicate the model learns chemically relevant patterns. TransPolymer presents a practical, generalizable tool for polymer property prediction and can be integrated into active-learning discovery workflows for virtual screening and guided synthesis. Future directions include extending to multi-task learning when multi-property labels are available, exploring classification tasks, improving interpretability beyond attention weights, and expanding pretraining corpora/architectures to further enhance out-of-distribution generalization.
- Data availability: Polymer datasets are often small and noisy; performance without data augmentation can degrade on scarce datasets (e.g., PE-II). Transformers are data-hungry, making augmentation and large unlabeled pretraining important.
- Pretraining coverage: Small pretraining sets lead to distribution mismatch and poorer downstream performance; effectiveness relies on broad coverage of chemical space.
- Finetuning necessity: Freezing the encoder and training only the head underperforms; full-model finetuning is typically required to capture task-specific information.
- Interpretability: Attention weights provide limited causal interpretability and may not fully reflect token importance due to value matrices and attention flow considerations.
- Comparative scope: While single-task performance is strong, prior multi-task models can outperform on some datasets when multi-property labels are available, which may not always be practical.
- Structure-agnostic constraints: The approach does not explicitly encode 3D conformations or chain configurations; such information is learned implicitly and may limit accuracy for properties highly sensitive to conformation.
Related Publications
Explore these studies to deepen your understanding of the subject.

