Engineering and Technology

Leveraging Language Representation for Materials Exploration and Discovery

J. Qu, Y. R. Xie, et al.

Discover a groundbreaking materials discovery framework that employs natural language embeddings from language models to enhance the exploration of crystal structures. This innovative approach allows for similarity analysis and multi-task learning, paving the way for diverse prototype structures in thermoelectrics. This exciting research was conducted by Jiaxing Qu, Yuxuan Richard Xie, Kamil M. Ciesielski, Claire E. Porter, Eric S. Toberer, and Elif Ertekin.

00:00

~3 min • Beginner • English

Index

Introduction

The discovery and design of inorganic materials is challenged by the vast chemical and structural search space, complex structure–property relationships, and limited labeled data. Traditional ab initio methods and high-throughput simulations have enabled prediction of material properties but are computationally intensive for large-scale exploration. Machine learning has accelerated materials workflows, yet its broader adoption is constrained by the need for universal, task-agnostic representations that capture composition, structure, and phenomena such as defects, alloying, and disorder. Hand-crafted descriptors and graph-based structural models can be task-specific and may not generalize across diverse property targets (e.g., features tuned for energy prediction can underperform for conductivity). Advances in natural language processing suggest that language models pretrained on materials literature encode rich domain knowledge. Prior work with word embeddings demonstrated latent knowledge capture for search and ranking in materials science, but lacked contextual understanding at sentence/paragraph level. This work investigates whether contextual language representations can provide general, scalable inputs for materials discovery tasks, enabling both similarity-based recall of candidates and accurate, multi-property ranking, with a case study in thermoelectrics.

Literature Review

Early ML for materials used physically informed, hand-crafted descriptors capturing composition and structure. Later, crystal structures were represented as graphs enabling local chemical environment feature extraction via graph neural networks, improving property prediction but often remaining task-specific. Word embeddings (e.g., Mat2Vec) trained on materials literature capture latent chemical relationships and have been used for material search and ranking. However, word-level embeddings lack contextual meaning found in sentences/paragraphs. Transformer-based contextual models (e.g., BERT variants) trained with masked language modeling on materials corpora (MatSciBERT, MatBERT) provide richer embeddings. Previous recommender-like systems have filtered candidates based on predicted confidence ranges for target properties, suggested compositions for pseudo-random systems, or proposed synthesis conditions, but a general pipeline combining representation, candidate recall, and multi-property ranking has been lacking. This study synthesizes these strands by leveraging contextual language representations for both candidate generation via similarity and downstream multi-task property prediction and ranking.

Methodology

Framework: A funnel-based recommender architecture with (1) candidate generation (recall) and (2) property-based ranking. Materials are converted to text descriptions capturing composition and structure, embedded via pretrained materials-domain language models. Recall uses cosine similarity in embedding space to retrieve candidates similar to a query. Ranking employs a multi-task neural model to predict multiple properties and score similarity to the query. Data and representations: A database of ~116K materials was prepared. Two input modalities were embedded: (a) composition (material formulae like "PbTe"), and (b) structure, using human-readable crystal structure descriptions automatically generated by Robocrystallographer/Robostructuralgrapher. Six embedding strategies were explored: Mat2Vec (word embedding) and MatSciBERT/MatBERT for composition; local-environment structure fingerprints and MatSciBERT/MatBERT sentence embeddings for structure. BERT-based embeddings were formed from full passages describing structure (768-D vectors). Composition embeddings for doped materials were normalized to the most abundant element in the unit cell. Candidate recall: For a query material, cosine similarity is computed between its embedding(s) and those of all materials; top-k similar materials are recalled. Unsupervised evaluation metrics include precision@k and normalized discounted cumulative gain (nDCG) using known relevant items for ground truth comparisons. UMAP projections visualize clustering by chemistry and structural similarity. Ranking and property prediction: A multi-gate mixture-of-experts (MMoE) multi-task network predicts target properties from concatenated inputs (composition embeddings, structure embeddings, and context features such as temperature). Experts are two-layer MLPs (128/64/32), task-specific gating networks are two-layer MLPs (32/16). Training uses MAE loss, 500 epochs, learning rate 1e-3, weight decay 1e-3, batch size 64, 5-fold cross-validation. For thermoelectrics (TE), tasks include log power factor, Seebeck coefficient, log electrical conductivity, log thermal conductivity, and zT, with host/doped composition and host structure embeddings as inputs plus temperature context (binned and one-hot encoded). Ranking score: After property prediction for recalled candidates, a relevancy score is computed as the reciprocal of the total absolute percent difference (TAPD) between candidate and query properties across K properties; candidates are ranked by normalized relevancy. Benchmark property prediction: Feature-based models (MLPs) trained directly on fixed embeddings (without fine-tuning) predict AFLOW properties (e.g., band gap, energy per atom, shear modulus, Debye temperature, thermal expansion) using 5,700 materials with cross-validation. Small-data experiments (200 train, 100 test) compare embedding-based models to graph convolutional networks. First-principles validation: AMSET estimates scattering rates and transport properties under MRTA considering acoustic phonon, polar optical phonon, and ionized impurity scattering; total scattering time via Matthiessen’s rule. Seebeck, electrical conductivity, and electronic thermal conductivity are from Boltzmann transport (BoltzTraP) using Onsager coefficients. DFT inputs computed with GGA-PBE. Lattice thermal conductivity k_L estimated via a semi-empirical modified Debye–Callaway model using parameters from structure (density, average atomic mass, volume per atom, coordination number, atoms per primitive cell); bulk modulus from Birch-Murnaghan EOS. Calculations assess top recommended candidates from clusters of interest. Experimental validation: Synthesis of selected candidates (e.g., CuZnGaTe2, CuGaGeSe4, HgGa2Te4) from high-purity elements via melt, mechanical milling (Spex 8000D), and hot pressing (up to 500 °C, 40 MPa). Electrical resistivity and Hall coefficient (Van der Pauw) measured with NETZSCH FLA 467 module; thermal conductivity obtained from thermal diffusivity, heat capacity (Dulong–Petit), and experimentally measured density; measured k_L ≈ 90% of theoretical or higher. Exploration analysis: UMAP regions with high predicted zT identified; clusters (halide/oxide perovskites, spinel-like, delafossite-like) examined and top candidates per cluster selected for first-principles screening at 900 K context.

Key Findings

- Language-based embeddings capture chemically and structurally meaningful information: UMAP projections show clear separation by anion groups for structure embeddings; composition embeddings preserve chemical differences with more heterogeneous clusters. - Unsupervised recall effectiveness: For top-100 recalled candidates using language embeddings, many show property similarity to queries (e.g., PbTe). In statistical tests, 94 of 100 top-ΔZ candidate sets have ΔZ distributions significantly different from random baselines, indicating successful enrichment for relevant materials. - Benchmark property prediction (AFLOW tasks): Using fixed embeddings with MLPs yields strong performance, often comparable to or exceeding baselines. Examples from Table 1 include: with structure MatBERT embeddings, energy per atom MAE ≈ 0.29 eV and R² ≈ 0.96; band gap R² up to ≈ 0.89; log thermal conductivity MAE ≈ 0.23 with strong R². Embedding-based models outperform GCNs in small-data regimes (200 train/100 test), highlighting transfer from pretrained language features. - Multi-task learning (MMoE) improves TE property prediction: Joint training leverages moderate inter-task correlations (Pearson 0.15–0.5). Seebeck prediction improves by ~17% over single-task models. Best inputs combine doped composition embeddings and host structure embeddings with temperature context; composition-only inputs perform worst. - TE multi-task performance (5-fold CV, best MMoE with concatenated features): • log Power factor: MAE 0.433, R² 0.778 • Seebeck coefficient: MAE 35.4, R² 0.872 • log Electrical conductivity: MAE 0.654, R² 0.832 • log Thermal conductivity: MAE 0.221, R² 0.841 • zT: MAE 0.088, R² 0.741 - Ranking diversifies structural prototypes: Top-15 recommendations for queries (e.g., CoSb3, CuGaTe2, Mg3Sb2) span multiple prototype structures sharing local motifs relevant to TE performance, enabling exploration beyond simple isostructural substitutions. - Discovery of under-explored high-zT regions: A "band" in representation space contains both known and predicted high-zT materials; a bottom-right subregion lacks experimental records yet shows high predicted zT. Clusters include halide perovskites (bromides with mean predicted zT > 1.0), oxide perovskites, spinel-like, and delafossite-like structures. Top bromide/fluoride perovskite candidates are often Cs-/K-containing double perovskites. - First-principles confirmations: Predicted high performance for candidates such as delafossite-like CdPbS2 (h-type, zT_max ≈ 1.7 at 800 K) and halide perovskite Cs2InSbBr6 (p-type, zT_max ≈ 1.0 at 800 K) corroborate promise. - Experimental indications: A top-ranked candidate exhibits high Seebeck (~250 μV/K at 575 K) and low thermal conductivity (~1.7 W m−1 K−1 at 575 K), with self-doping giving a Hall carrier concentration ~4.5 × 10^-3 cm^-3 at 473 K; compared to CaGeTe4 (≈320 μV/K; k ≈ 2.5 W m−1 K−1 at 575 K). Other candidates show very low room-temperature k (e.g., HgGa2Te4 ≈ 0.36 W m−1 K−1; CuGaGeS4 ≈ 0.62 W m−1 K−1).

Discussion

The study addresses the representation bottleneck in materials discovery by leveraging contextual language embeddings as general, task-agnostic inputs that encode composition, structure, and domain knowledge from literature. Unsupervised recall in embedding space retrieves candidates structurally/chemically related to queries and enriched for similar functional performance, while supervised multi-task ranking capitalizes on correlated properties to improve prediction accuracy. The framework balances exploration and exploitation: by restricting ranking to recall neighborhoods, it avoids matching properties arising from unrelated physical origins and, by adjusting the number of recalled candidates, it can diversify prototype structures while maintaining relevance. Compared with decoder-only LLMs, encoder-only BERT-style models are well-suited for deriving rich, general-purpose representations; materials-domain pretraining further boosts performance on materials tasks. The approach naturally accommodates off-stoichiometry (doping/alloying) via compositional text inputs, which is challenging for traditional first-principles disorder modeling. Application to thermoelectrics demonstrates that language-based representations, combined with MMoE, achieve competitive predictive accuracy and reveal promising, under-explored materials spaces, with supporting first-principles calculations and initial experiments underscoring practical viability.

Conclusion

This work introduces a general materials recommendation and discovery framework that uses language model embeddings as unified representations of composition and structure. The two-stage funnel (unsupervised recall, supervised multi-task ranking) effectively retrieves and prioritizes candidates similar to a query in both structure and functional performance. Across AFLOW benchmarks and thermoelectric tasks, language-based representations yield strong predictive accuracy, particularly when combined with multi-task MMoE and context features (temperature, doping). The framework uncovers under-studied regions in representation space (e.g., halide perovskites, spinel- and delafossite-like compounds) with high predicted zT; first-principles and experimental results validate several recommendations. Future work should enhance structural text diversity (e.g., off-centering, distortions), incorporate richer context (processing conditions, microstructure), extend to other functional materials domains, and explore integration with generative models for inverse design and active learning for closed-loop discovery.

Limitations

- Representation scope: Structural descriptions generated by Robocrystallographer may miss fine details (off-centering, bond distortions, long-range order/disorder) and stylistic diversity; enriching descriptions could improve embeddings. - Task specificity and generalization: While task-agnostic, performance may vary across properties; some tasks showed modest gains under multi-tasking and depend on data quality and coverage. - Exploration–exploitation balance: The number of recalled candidates affects diversity vs. relevance; tuning is required and problem-dependent. - Doping/alloying modeling: Approximate handling via normalized compositional embeddings and cosine similarity mapping to hosts may not capture complex defect chemistry or site preferences. - Data limitations: Limited labeled data for certain materials/properties and potential biases in literature-derived corpora can affect representation quality and model generalizability. - Validation breadth: Experimental validations are initial and limited in scope; broader synthesis and characterization campaigns are needed to confirm performance across recommended spaces. - Domain and model choices: Findings emphasize encoder-only, materials-domain LMs; results with general LLMs or alternative encoders may differ. The framework is demonstrated on thermoelectrics; transfer to other applications requires additional validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

A public database of thermoelectric materials and system-identified material representation for data-driven discovery

G. S. Na and H. Chang

Engineering and Technology

MaterialsAtlas.org: a materials informatics web app platform for materials discovery and survey of state-of-the-art

J. Hu, S. Stefanov, et al.

Chemistry

Bias-Free Multiobjective Active Learning for Materials Design and Discovery

K. M. Jablonka, G. M. Jothiappan, et al.

Medicine and Health

Leveraging Large Language Models for Precision Monitoring of Chemotherapy-Induced Toxicities: A Pilot Study with Expert Comparisons and Future Directions

O. R. Sarrias, M. P. M. D. Prado, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny