Chemistry
Leveraging large language models for predictive chemistry
K. M. Jablonka, P. Schwaller, et al.
Large language models (LLMs) or foundation models can generate coherent text and have shown surprising capabilities, including solving simple tabular regression and classification tasks despite not being trained for them. This motivates the question of whether such models can answer scientific questions in chemistry, where problems are naturally expressible as text (for example, predicting whether a metal-organic framework is stable in water). Chemistry and materials science often suffer from small datasets, making data-efficient approaches crucial. The authors explore whether fine-tuned GPT‑3 can be adapted to answer chemistry questions and perform prediction and design tasks across molecules, materials, and reactions, potentially shifting standard machine learning workflows by leveraging pre-trained language model knowledge.
Prior work has applied language models to chemistry for property prediction and molecule design, typically using chemistry-specific pretraining objectives or data. LLMs have been assessed for inherent chemistry knowledge and demonstrated in-context learning abilities. Conventional representations like IUPAC names, SMILES, and SELFIES have been used extensively in ML for chemistry. Benchmarks such as Matbench provide baselines for materials property prediction. Prior inverse design efforts used generative models (VAEs, GANs) with large datasets or evolutionary strategies (genetic algorithms). Photoswitch discovery with Gaussian processes and datasets like QMugs for quantum properties (including HOMO-LUMO gaps) are established references that this work leverages for evaluation and comparison.
The authors use language-interfaced fine-tuning (LIFT) to adapt GPT‑3 to non-language tasks by converting inputs and outputs into natural language prompts and text completions. For classification, questions such as 'What is the phase of
Training details: All case studies use the same fine-tuning hyperparameters (8 epochs, learning rate multiplier 0.02). Model training typically requires only minutes. Representation experiments compare IUPAC names, SMILES, and SELFIES for molecular properties.
Evaluation: The approach is benchmarked across molecules, materials, and reaction datasets, comparing to state-of-the-art baselines (including Automatminer, CrabNet, ModNet, TabPFN, XGBoost, RF, GPR, and others). Data efficiency is quantified by fitting learning curves to power laws and determining where baseline and GPT‑3 curves intersect (factor of additional data needed for parity in the low-data regime). Validity of generated SMILES is checked using RDKit via Guacamol’s is_valid. For inverse design, sampling temperature is varied to balance validity, novelty, diversity, and property match; synthesizability is assessed via SA score. Some experiments also demonstrate in-context learning (without fine-tuning) and fine-tuning of open-source LLMs (e.g., GPT‑J‑6B with LoRA and 8-bit quantization).
- Fine-tuned GPT‑3 matches or outperforms specialized ML models in low-data regimes across diverse chemistry tasks. In high-data regimes, conventional ML often catches up.
- High-entropy alloys: With roughly 50 training points, GPT‑3 achieves accuracy comparable to a random forest model trained on approximately 1,126 points (from a 1,252-point dataset with 10-fold CV). It also outperforms simple rule-based and several automated baselines (Automatminer, CrabNet) at low data volumes.
- Classification across molecules/materials/reactions: Data-efficiency analyses (Extended Data Table 2) show GPT‑3 generally requires fewer data to reach parity with strong baselines (e.g., TabPFN, ModNet, XGBoost, GPR) for properties such as HOMO-LUMO gap class, solubility classes, lipophilicity classes, alloy phases, Henry coefficients (CO2/CH4), heat capacity categories, and cross-coupling reaction outcomes.
- Representation robustness: Good performance is observed across IUPAC, SMILES, and SELFIES; often IUPAC performs best, simplifying use by non-specialists.
- Regression via discretization: Rounded/quantized targets enable near state-of-the-art performance in some cases, though more data are needed than for classification and the advantage over baselines diminishes.
- Inverse design (photoswitches): GPT‑3 generates valid, often synthesizable (average SA score < 3) molecules matching desired transition wavelengths with a mean absolute percentage error around 10%. Many generated molecules are novel relative to the training set; some are not present in PubChem. TMAP visualization shows both derivatives and entirely new scaffolds.
- Temperature effects: Low temperature yields less diverse outputs and more training-set repeats; moderate temperatures improve diversity and distribution match (minimum Fréchet ChemNet distance); very high temperatures reduce validity.
- Coarse-grained polymer dispersants: Despite abstract representations, GPT‑3 predicts adsorption free energies better than prior ML baselines and supports inverse design with mean percentage error about 22% (noting the ground-truth approximation itself has ~9% error).
- HOMO-LUMO gap studies: With only 500 samples for training, GPT‑3 provides reasonable estimates; inverse design yields novel molecules not in training nor QMugs. Extrapolation beyond training range is demonstrated: models trained only on gaps <3.5 eV can generate molecules with computed gaps >4.0 eV. Iterative biasing with quantum evaluations shifts the generated distribution toward very large gaps (>5 eV) over several fine-tuning iterations.
The study shows that a general-purpose LLM, fine-tuned with natural language prompts, can effectively learn correlations in small chemistry datasets and provide competitive predictions across tasks traditionally addressed by specialized models. This addresses the core question of whether LLMs trained on general text can be adapted to predictive chemistry and materials science tasks. Advantages include strong low-data performance, flexible input representations (including abstract encodings), and ease of applying the same training interface across heterogeneous tasks. Inverse design becomes straightforward by inverting the prompt/completion.
The findings suggest LLMs can bootstrap projects similarly to literature searches by leveraging broad, encoded knowledge and quickly establishing useful baselines. However, as datasets grow, specialized models may match or surpass performance, and LLM-derived correlations are not necessarily causal, necessitating further analysis and validation. The observed representation insensitivity and extrapolation capabilities point to LLMs’ potential to generalize beyond seen examples, but careful verification (computational and experimental) remains essential.
The paper demonstrates that fine-tuned GPT‑3 can serve as a versatile, data-efficient predictive and generative tool in chemistry and materials science. It achieves competitive or superior performance to specialized models in low-data regimes, supports regression via discretized outputs, and enables simple yet effective inverse design that can produce valid, synthesizable, and novel molecules, including extrapolation beyond training ranges. The uniform, natural-language interface lowers barriers for non-specialists and can act as a strong baseline for future studies.
Future work includes optimizing fine-tuning strategies and representations, integrating broader and more recent datasets, systematically validating generated candidates (computationally and experimentally), probing and interpreting learned correlations to approach causal understanding, and expanding support for open-source LLMs and efficient fine-tuning techniques. The authors anticipate querying pre-trained LLMs will become routine to bootstrap predictive tasks and guide early-stage materials and molecular discovery.
- Fine-tuning was not exhaustively optimized (prompt formats, tokenization tailored to chemical strings, epochs, learning rates), leaving potential performance gains unexplored.
- True continuous regression is not performed; discretization/rounding is required, which can limit precision and may demand more data than classification.
- Data-efficiency comparisons emphasize the low-data regime and binary settings; in high-data regimes, specialized baselines often catch up or surpass GPT‑3.
- Validity checks for generated molecules rely on SMILES parseability (RDKit) and SA scores; chemical feasibility and synthesizability require deeper validation and experimental confirmation.
- Some property evaluations (e.g., photoswitch wavelengths) use surrogate models (GPR) rather than experiments; ground-truth approximations carry their own errors.
- The GPT‑3 pretraining corpus (up to Oct 2019) may omit relevant structured datasets and newer literature.
- Observed correlations enabling predictions are not necessarily causal; interpretations should be made cautiously.
Related Publications
Explore these studies to deepen your understanding of the subject.

