logo
ResearchBunny Logo
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Computer Science

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Y. Zhang, X. Chen, et al.

This paper provides a comprehensive survey of over 260 scientific LLMs, unveiling cross-field and cross-modal connections in architectures and pre-training techniques, summarizing pre-training datasets and evaluation tasks for each field and modality, and examining deployments that accelerate scientific discovery. Resources are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models. This research was conducted by Yu Zhang, Xiusi Chen, Bowen Jin, Sheng Wang, Shuiwang Ji, Wei Wang, and Jiawei Han.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the need for a holistic, cross-field and cross-modal survey of scientific large language models (LLMs). Motivated by the rapid expansion of LLMs beyond text into other scientific data types (e.g., molecules, proteins, tables, images, genomes), the authors highlight that prior surveys typically focus on individual fields or a single modality. The research goal is to systematically map architectures, pre-training strategies, datasets, and evaluation tasks across domains; identify commonalities and interconnections; and examine how LLMs contribute to scientific discovery. The introduction outlines three main pre-training paradigms used across scientific LLMs—masked language modeling (encoder), next-token prediction with instruction tuning ((encoder-)decoder), and contrastive learning with dual encoders—showing how diverse scientific data are sequentialized or paired for these paradigms. The paper’s purpose is to provide a comprehensive overview to guide future design and deployment of scientific LLMs, emphasizing their importance in accelerating hypothesis generation, planning, reasoning, and experimentation.
Literature Review
The survey situates itself among prior domain-specific reviews that concentrate on limited fields (e.g., biomedicine, chemistry) or on text-only LLMs. It references existing scientific LLMs in general science, mathematics, physics, chemistry and materials science, biology and medicine, and geosciences, discussing their typical datasets (e.g., AMiner, MAG, Semantic Scholar for papers; UniRef for proteins; GRCh38 and 1000 Genomes for DNA; ZINC and ChEBI for molecules; MIMIC for EHR and medical images; ERA5 and CMIP6 for climate time series), model backbones (BERT-like encoders, GPT/LLaMA-like decoders, and dual encoders for contrastive learning), and common evaluation tasks (NER, RE, QA, retrieval, classification, reasoning, image-text tasks, property prediction, forecasting). The review underscores architectural analogies across modalities (language, graph, vision, table, molecule, protein, genome, and time series) and reports recent trends such as instruction tuning and preference optimization, multimodal integration, and specialized models per subdomain.
Methodology
This is a systematic survey of over 260 scientific LLMs spanning multiple fields and modalities. The authors analyze and categorize models by: (1) pre-training strategies—masked language modeling for encoder models, next-token prediction (often with instruction tuning) for (encoder-)decoder models, and contrastive dual-encoder learning; (2) how different scientific data types are sequentialized (e.g., SMILES/SELFIES for molecules; flattened tables; visual tokens from images; sequences from proteins and genomes) or preserved with dedicated encoders (e.g., graph or vision encoders); (3) pre-training datasets and benchmarks per field and modality; and (4) evaluation tasks. They provide cross-field architectural mappings to Figure 1, and compile structured summary tables (A1–A6) detailing modality, parameter size, architecture, pre-training data/tasks, and evaluations. The survey also documents applications of LLMs in scientific discovery across stages such as hypothesis generation, theorem proving, experiment design, drug discovery, and weather forecasting.
Key Findings
- Scientific LLMs across domains can be grouped by three predominant pre-training paradigms: (1) masked language modeling with encoder-only architectures; (2) autoregressive next-token prediction (often with instruction tuning and preference optimization) for (encoder-)decoder architectures; and (3) contrastive dual-encoder learning to align paired modalities (text-text, text-graph, text-image, text-protein). - Diverse scientific data are effectively sequentialized or tokenized: molecules via SMILES/SELFIES; biological sequences (proteins/DNA/RNA) via FASTA/k-mer representations; tables via linearization; images via visual token encoders; graphs via linearization (e.g., SMILES), adapters, or graph encoders. - The survey covers 260+ LLMs across general science, mathematics, physics, chemistry/materials, biology/medicine, and geosciences, spanning modalities such as language, graph, vision, table, molecule, protein, genome, and climate time series, with sizes from ~100M to ~100B parameters. - Common datasets and benchmarks are cataloged per field (e.g., S2ORC, Semantic Scholar; UniRef, Swiss-Prot; GRCh38, 1000 Genomes; ZINC, ChEBI; MIMIC-CXR, ROCO; ERA5, CMIP6), alongside typical evaluation tasks (NER, RE, QA, retrieval, classification, reasoning, vision-language tasks, molecular property prediction, forecasting). - Instruction tuning has become pivotal for domain adaptation, enabling complex scientific reasoning and dialogue (e.g., Med-PaLM, Galactica, domain-specific LLaMA-based models). - LLMs demonstrate tangible scientific utility: generating hypotheses and ideas, mathematical theorem solving with hybrid systems, autonomous experiment planning in chemistry and biology, sequence-based protein design and structure prediction, and climate/weather forecasting via Transformer-based foundation models.
Discussion
The findings meet the survey’s objective by revealing cross-field and cross-modal architectural commonalities, thereby offering a unified lens through which to design and adapt scientific LLMs. Mapping techniques (MLM, next-token prediction with instruction tuning, contrastive multimodal alignment) across language, graph, vision, and sequence data illustrates transferable design principles and training recipes. This synthesis clarifies how scientific data can be sequentialized or integrated through modality-specific encoders and adapters, informing future model development. The survey also emphasizes the increasing role of LLMs in scientific discovery pipelines—brainstorming, reasoning, experiment design, and evaluation—highlighting their relevance to accelerating research while pointing to the need for robustness, trustworthiness, and handling out-of-distribution data.
Conclusion
The paper consolidates a large body of scientific LLM research, categorizing models across fields and modalities, summarizing pre-training corpora, architectures, tasks, and evaluations, and documenting applications in scientific discovery. It contributes a cross-field, cross-modal perspective and structured summary tables to guide future designs. The authors outline future directions: (1) building fine-grained, theme-focused resources (e.g., knowledge graphs) to preserve specialized domain signals; (2) improving generalization to out-of-distribution scientific data (e.g., via invariant learning); and (3) enhancing trustworthiness with cross-modal retrieval-augmented generation that leverages heterogeneous scientific data. These directions aim to make scientific LLMs more reliable, specialized, and effective in accelerating discovery.
Limitations
The survey primarily focuses on mathematics and natural sciences, leaving social science LLMs and agent-based social simulations for future work. It centers on models pre-trained or augmented with scientific domain data, excluding studies that only benchmark general-purpose LLMs on scientific tasks. Some LLMs could belong to multiple field/modality categories under the chosen taxonomy (e.g., BioMedGPT, GIT-Mol); for brevity, they are presented in one subsection. Space and scope constraints prevent exhaustive coverage of all related efforts.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny