Chemistry
A public database of thermoelectric materials and system-identified material representation for data-driven discovery
G. S. Na and H. Chang
Discover the exciting ESTM dataset, showcasing experimentally synthesized thermoelectric materials with remarkable predictive models! Conducted by Gyoung S. Na and Hyunju Chang, this research demonstrates how a novel material descriptor, SIMD, significantly enhances prediction accuracy and aids in high-throughput screening for superior thermoelectric materials.
~3 min • Beginner • English
Introduction
The study addresses the challenge of discovering high-performance thermoelectric (TE) materials, which convert heat to electricity via the Seebeck and Peltier effects and are critical for energy harvesting, cooling, and power generation. Traditional first-principles approaches such as density functional theory (DFT) struggle to handle doped and alloyed systems due to computational cost in large unit cells, making property prediction for complex TE materials difficult. Machine learning (ML) methods, including graph neural networks, have shown strong performance but often require crystal structures, which are typically unavailable for doped/alloyed materials. Composition-only models like Roost and DopNet advance this, but Roost is limited to pristine materials and DopNet’s host/dopant split may not capture complex thermodynamics across material groups. The authors note a lack of a public, experimentally curated TE database and of a descriptor enabling extrapolation to new material systems. This work aims to: (1) build a public dataset of experimentally synthesized thermoelectric materials (ESTM) with measured properties; (2) develop a system-identified material descriptor (SIMD) that enhances extrapolation across unseen material groups; and (3) demonstrate improved predictive performance and screening utility for discovering novel high-ZT materials.
Literature Review
Prior efforts in computational materials include DFT-based analyses of electronic structures and properties, but DFT’s scalability limits applicability to doped/alloyed TE materials with large or complex cells. ML has achieved strong results predicting material properties; GNNs using crystal graphs excel when structures are known but are less applicable without structural data. Composition-based models include Roost, which learns from stoichiometry via an elemental graph (effective for pristine crystals), and DopNet, which separately embeds hosts and dopants to model doped/alloyed systems and achieved state-of-the-art thermoelectric property predictions. Additional ML approaches (neural networks and ensembles) have been applied to TE property prediction. However, gaps remain: no widely available, curated public database of experimental TE data and no composition-based descriptor tailored for extrapolation to unknown material groups. This work builds on these advances by providing a public experimental dataset (ESTM) and introducing SIMD to incorporate system-level information for better extrapolation.
Methodology
Data construction (ESTM): The authors curated a public dataset of experimentally synthesized thermoelectric materials (ESTM) via literature search, yielding 5,205 observations uniquely defined by composition and measurement temperature. Each row includes composition, temperature, Seebeck coefficient, electrical conductivity, thermal conductivity, power factor, ZT, and DOI. The dataset covers 880 unique materials and 65 elements (Li–Bi). Reported ZT maxima: 2.16 at ≥700 K and 1.17 near room temperature (~300 K). Quantitative ranges and distributions for features are summarized (e.g., temperature: 10–1275 K; ZT: (0, 2.28), mean 0.35 ± 0.35). Sparse encoding is used to transform composition into numerical vectors (details in Methods). Electrical and thermal conductivity values are log-transformed to reduce variance for ML training.
Interpolation experiments: Six ML methods were evaluated to predict Seebeck coefficient, electrical conductivity, thermal conductivity, and ZT from composition and temperature: Ridge regression, KNN regression, SVR, Gaussian process regression, fully connected neural network (FCNN; 2 hidden layers), and XGBoost (XGB). Three-fold cross-validation was used (training on 2 folds, testing on 1; repeated to cover all folds). Sparse composition encoding is applied; log-transform for conductivities. Performance metrics included MAE and R^2.
System-Identified Material Descriptor (SIMD): SIMD creates system-conditioned features to enhance extrapolation.
- Step 1: Material cluster generation. A cluster identifier id(s) is defined by rounding element ratios >0.5 in composition s: id(s) = {(e, round(r_e)) | r_e > 0.5}. Alloy/doped variants derived from a common pristine (e.g., SnS_x, Ta_1−xTi_xFeSb) are grouped under a shared identifier (e.g., SnS, TaFeSb).
- Step 2: Material cluster characterization. For each cluster, two latent vectors are computed:
• System vector w (dimension d = M + L), solved via least squares from a linear system relating atomic feature vectors X (M-dim), condition vectors C (L-dim; e.g., temperature), and target y over samples in the cluster. This summarizes local input–target relationships.
• Target statistics vector v = [mean, std, min, max] of target values in the cluster.
The cluster representation is the concatenation of w and v.
- Step 3: System-identified feature generation. Define an anchor space (12-D) for compositions using elemental atomic number, atomic volume, and atomic weight statistics (mean, std, min, max). For a new input s, use KNN in anchor space to select K nearest clusters, compute distance-weighted aggregates of their system vectors and target statistics, then form z_s = x_s ⊕ c_s ⊕ w^(K) ⊕ v^(K). These augmented features replace original inputs for downstream ML.
Implementation specifics (Methods):
- Sparse encoding x: 100-D vector for elements H–Fm; x_i = r_e if i matches the atomic number of element e in s; else 0.
- Anchor vector a: 12-D concatenation of mean/std/min/max of three elemental attributes (atomic number, volume, weight) for the composition.
- Algorithm 1 outlines SIMD’s end-to-end feature generation (cluster creation, system/statistics vectors, feature concatenation).
Transfer learning with SIMD for extrapolation: SIMD is applied to transfer learning by merging a large source dataset (Starry; 215,683 entries of TE materials and ZT, containing mixed experimental and theoretical values and potential parsing issues) with the ESTM training data to construct clusters and system-conditioned representations. Four models are trained for extrapolation to unknown material groups: FCNN_t (pretrain on Starry then finetune on ESTM), FCNN_m (train on merged Starry+ESTM), XGB_t (train on merged), and SXGB_t (XGB trained on merged data transformed by SIMD). Extrapolation evaluation splits are by material groups (entire groups withheld from training; e.g., all SnS-derived materials removed from training if SnS is in test).
High-throughput screening tasks: Binary classification of high-ZT materials using predicted ZT thresholds under (a) overall and (b) temperature-constrained settings: near-RT (290–310 K, ZT ≥ 0.5), common TE range (300–600 K, ZT ≥ 0.8), and high temperature (T ≥ 600 K, ZT ≥ 1.5). Metrics include F1-score, confusion matrices, and false positives.
Exploration of virtual dopant spaces: Using SXGB_t, generate virtual doped compositions for a given host (e.g., Bi0.5Sb1.5Te3) by adding candidate dopants (elements H–Fm) at concentrations 0.001–0.1 and predict ZT across temperatures (300–800 K). Rank candidates by predicted ZT (e.g., top 10% at 300 K) and compare against literature to validate identified dopants.
Hyperparameter study (K): Evaluate R^2 of SXGB_d vs K ∈ {1,…,~2048} and vs data size (number of clusters m ∈ {128, 256, 512}). Findings: performance relatively stable with K; for small m, larger K (4–8) helps; for larger m, K = 1–2 suffices.
Global search extension: Formulate discovery as constrained optimization using trained extrapolator f(x;θ) (e.g., SXGB) and constraint penalty g(x) (e.g., limit number of elements), enabling integration with metaheuristic optimizers (e.g., equilibrium optimizer) for broader material space exploration.
Key Findings
- Public ESTM dataset: 5,205 experimental observations across 880 unique thermoelectric materials and 65 elements, with five properties (Seebeck, electrical conductivity, thermal conductivity, power factor, ZT) and temperatures 10–1275 K. Max ZT: 2.16 (≥700 K), 1.17 (~300 K).
- Interpolation performance (3-fold CV): XGB outperformed all baselines with R^2 > 0.9 for all targets and lowest MAEs:
• Seebeck coefficient: MAE 21.10 ± 0.48 µV/K, R^2 0.96 ± 0.01.
• Electrical conductivity (log-transformed): MAE 0.28 ± 0.02, R^2 0.92 ± 0.01.
• Thermal conductivity (log-transformed): MAE 0.09 ± 0.01 W/mK, R^2 0.97 ± 0.01.
• ZT: MAE 0.06 ± 0.01, R^2 0.91 ± 0.01.
Linear and kernel baselines (Ridge, SVR, GPR) underperformed (R^2 < 0.5 or negative).
- Extrapolation to unseen material groups:
• Baseline XGB: R^2 = 0.13 ± 0.07 for ZT prediction.
• Transfer learning improved performance: XGB_t ≈ 0.52 ± 0.09; FCNN_t ≈ 0.22 ± 0.07; FCNN_m ≈ 0.48 ± 0.10.
• SIMD-transformed XGB (SXGB_t): R^2 = 0.71 ± 0.05, a large gain over XGB (+0.58) and XGB_t (+0.19).
- High-throughput screening for high-ZT (unseen groups):
• Threshold ZT ≥ 1.5: XGB_t F1 = 0.49; SXGB_t F1 = 0.61; false positives reduced from 21 to 6; precision for predicted positives improved from 50.00% to 78.57%.
• Temperature-constrained screening:
- Near RT (290–310 K, ZT ≥ 0.5): F1 0.36 (XGB) vs 0.65 (SXGB); false positives reduced from 13 to 4 (approx. 69% reduction).
- 300–600 K (ZT ≥ 0.8): F1 0.50 vs 0.63; false positives reduced from 55 to 27 (~51% reduction).
- ≥600 K (ZT ≥ 1.5): F1 0.55 vs 0.62; false positives reduced from 21 to 6 (~71% reduction). Overall ≈50% or better FP reduction across tasks.
- Virtual dopant exploration: SXGB_t identified Ag and Ti as promising dopants for Bi0.5Sb1.5Te3 at low temperatures; predicted ZT–T trends align with reported experimental improvements for Ag/Ti doping.
- Hyperparameter robustness: Extrapolation accuracy of SXGB_d is relatively insensitive to K with sufficiently large training data; for small datasets, modest K > 1 (e.g., 4–8) improves performance by leveraging multiple similar clusters.
Discussion
The work demonstrates that an experimentally curated dataset (ESTM) combined with a system-level descriptor (SIMD) enables accurate interpolation and, crucially, improved extrapolation of thermoelectric performance from composition and conditions without crystal structures. Interpolation results (R^2 > 0.9) validate the dataset’s quality and the efficacy of tree-based models for property prediction. Extrapolation to unseen material groups is typically challenging; the baseline models fail (R^2 ≈ 0.13–0.15). By clustering materials into systems and summarizing local input–target relationships (system vectors) and property distributions (target statistics), SIMD equips the model with context beyond individual compositions. Transfer learning with a large, albeit noisy, source set (Starry) further enriches the system database. Together, these enhance generalization to new material systems (SXGB_t R^2 ≈ 0.71), capturing the approximate mapping from composition/temperature to ZT in novel groups. In high-throughput screening, reducing false positives is critical to avoid costly experimental dead-ends; SIMD consistently halves false positives and raises F1, suggesting more reliable candidate prioritization. Case analysis (e.g., Mg–Li–Ge–Si system) shows SIMD helps disambiguate chemically similar but system-different materials, mitigating mispredictions caused by naive compositional proximity. Robustness analyses indicate practical guidance for K selection: small datasets benefit from aggregating multiple nearest systems; larger datasets often need only the closest system. The demonstrated dopant-space exploration underscores practical utility: identifying known performance-enhancing dopants and reproducing temperature-dependent ZT trends, supporting data-driven discovery workflows. Overall, the findings substantiate that system-aware descriptors and transfer learning can bridge the extrapolation gap in materials informatics, especially when structural data are unavailable.
Conclusion
This study offers: (1) a public, curated ESTM dataset of 5,205 experimental observations across 880 thermoelectric materials with key properties and conditions; (2) a system-identified material descriptor (SIMD) that clusters related materials, characterizes system-level input–target relationships, and augments composition-based features via KNN over an anchor space; and (3) empirical evidence that SIMD substantially improves extrapolation to unseen material groups (R^2 up to ~0.71 for ZT) and enhances high-throughput screening reliability by markedly reducing false positives under general and temperature-constrained scenarios. The approach further enables practical tasks such as dopant-space exploration, recovering known beneficial dopants and temperature-dependent performance trends. All data and code are publicly released to facilitate reuse. Future directions include expanding and refining experimental datasets, integrating additional physical attributes and constraints, applying SIMD to other material properties, and leveraging metaheuristic global optimization with the trained extrapolators to automate discovery under domain-specific constraints.
Limitations
- Source dataset (Starry) used for transfer learning mixes experimental and theoretical values and may contain parsing errors; while beneficial as auxiliary knowledge, its noise can limit transfer quality.
- Extrapolation performance, though substantially improved (R^2 ~0.71), is not perfect; certain material groups remain challenging, and predictive uncertainty is not explicitly modeled.
- Material clustering uses rounded dominant stoichiometry and ignores processing/measurement conditions when defining clusters; complex systems with subtle compositional/processing effects may not be fully captured.
- Composition encoding and conditioning are relatively simple (sparse elemental ratios plus temperature), without explicit structural or microstructural descriptors, potentially limiting fidelity for structure-sensitive properties.
- Dataset coverage, while broad (65 elements, 880 materials), may be sparse in some chemistries and temperature ranges, which can affect generalizability.
- Conductivity targets are log-transformed for training; interpretation of MAE values requires attention to this transformation.
- Hyperparameter sensitivity to K can matter in small-data regimes; careful validation is needed when data are limited.
Related Publications
Explore these studies to deepen your understanding of the subject.

