Earth Sciences
ChatClimate: Grounding conversational AI in climate science
S. A. Vaghefi, D. Stammbach, et al.
The study addresses how to reduce hallucinations and overcome outdated knowledge in large language models for climate-related question answering. While LLMs excel at closed-book QA, they often generate incorrect or unverifiable statements and lack up-to-date content post-training. In climate change contexts, accuracy, reliability, and timely references are crucial for informing decisions and policies. The authors hypothesize that grounding LLM responses in authoritative, current sources—specifically IPCC AR6—via retrieval-augmented generation and carefully engineered prompts can improve factual accuracy, reduce hallucinations, and provide verifiable citations. The purpose is to build and test ChatClimate, a climate-domain conversational AI that augments GPT-4 with IPCC AR6 content, and to evaluate its performance against baseline GPT-4 and an IPCC-only variant on a set of challenging climate questions.
The paper situates its work within advances in LLMs (e.g., GPT, BERT, T5, PaLM, LLaMA, GPT-3/3.5/4) and notes their strengths and limitations, including hallucination and knowledge staleness. Prior work shows LLMs can function as implicit knowledge bases and perform well on multiple-choice and zero/few-shot QA, but factuality and updating remain concerns. Retrieval from external sources has been explored to improve factual grounding. In the climate domain, NLP has been applied to financial climate disclosures, detection and verification of environmental claims, and specialized datasets and models (e.g., ClimateBERT, climate claims fact-checking). Conversational AI and QA systems are recognized as valuable for bridging complex climate science and public understanding, yet risk issues and the need for trustworthy sources are emphasized. The IPCC AR6 reports are highlighted as comprehensive, authoritative, and current sources well-suited for grounding climate QA.
The authors build ChatClimate, a retrieval-augmented conversational AI grounded in IPCC AR6. Data pipeline: (1) Convert IPCC AR6 PDFs (7 documents; specifics in Supplementary Information) to JSON via a Python PDF parser; (2) Split text into manageable chunks using LangChain; (3) Embed chunks using OpenAI’s text-embedding-ada-002; (4) Store embeddings in a vector database (inserted in batches per Pinecone V2 guidelines) for efficient semantic search; (5) For a user query, create an embedding, retrieve top-k nearest chunks using dot-product similarity, decode to text, and augment the prompt to GPT-4. Prompting and chatbot configurations: Three scenarios are used: (a) Hybrid ChatClimate, which provides five retrieved IPCC AR6 snippets and allows model in-house knowledge, requiring explicit citation of IPCC sources and page numbers and indicating which parts come from IPCC vs in-house knowledge; (b) ChatClimate (IPCC-only), which answers strictly from retrieved IPCC AR6 content with citations and declines if insufficient information is available; (c) Baseline GPT-4, with no external retrieval. Prompt templates are detailed (Boxes 1–5), emphasizing citation requirements and source attribution. Evaluation: The systems are tested on 13 carefully designed questions (Q1–Q13) spanning topics such as feasibility of limiting warming to 1.5°C, timing to reach 1.5°C, overshoot, emissions reductions, adaptation finance, climate justice, maladaptation, regional disproportionate impacts, and a tricky domain-specific question about glaciers in Scotland. Answers are expert-validated for accuracy and citation quality. Sensitivity analyses explore prompt engineering impacts and retrieval hyperparameters (top-k set to 5, 10, 15) on completeness and accuracy. Additional analyses include cross-checking references and discussing the system’s current inability to process tables and figures.
- Across 13 questions, expert evaluation indicates the hybrid ChatClimate achieved higher accuracy than both GPT-4 baseline and IPCC-only ChatClimate. Examples: For Q1 (Is it still possible to limit warming to 1.5°C?), accuracy scores were Hybrid 5, ChatClimate 4, GPT-4 3. For Q2 (When will we reach 1.5°C?), scores were Hybrid 5, ChatClimate 4, GPT-4 2. Hybrid and ChatClimate provided explicit IPCC citations and page numbers consistently; GPT-4 responses were more general and less consistently cited. - Prompt engineering improved performance: In a prompt comparison (Table 3), Hybrid Prompt-1 achieved accuracy 3.5, while Hybrid Prompt-2 achieved accuracy 5 on the same task, indicating substantial sensitivity to prompt design. - Retrieval hyperparameters influenced completeness/accuracy: For a regional impacts question (Table 4), varying top-k yielded accuracy scores of 3 (k-5), 3 (k-10), and 4 (k-15), showing that increasing retrieved context can enhance coverage. - The IPCC-only (standalone) variant better avoided hallucinations on tricky questions where in-house knowledge could mislead. For Q13 about glaciers in Scotland, the hybrid system incorrectly implied current glaciers exist (drawing on in-house knowledge), while the IPCC-only system correctly noted insufficient information, thereby avoiding a factual error. - Additional comparison on Q2 across different standalone configurations reported accuracy scores: IPCC SYR standalone 4, ChatClimate (IPCC AR6 reports) 3, Standalone ChatWMO 5, reflecting that the choice of external source corpus and task framing impacts performance. - Overall, integrating authoritative, up-to-date IPCC AR6 content improved factuality, reduced hallucination relative to baseline GPT-4, and produced verifiable, cited responses. Effects were further enhanced by careful prompt engineering and appropriate retrieval settings.
The results support the hypothesis that retrieval-augmenting LLMs with trustworthy, domain-specific sources increases answer accuracy and reliability for climate QA. The hybrid approach benefits from both IPCC grounding and in-house knowledge, but the latter can reintroduce hallucination risks in edge cases; in contrast, the IPCC-only mode safeguards against unsupported claims but may be less comprehensive. Prompt engineering has a pronounced impact on performance, indicating that instruction clarity and citation requirements help guide models to use retrieved evidence effectively. Retrieval hyperparameters (top-k) affect sufficiency and completeness, with larger k improving coverage in some cases but not uniformly. The findings underscore the importance of authoritative data sources, explicit sourcing, and post-answer reference cross-checking for trustworthy climate communications. While the tool aids access to complex reports and supports decision-making, it is not a decision-making system; human oversight and domain expertise remain essential, especially to detect residual hallucinations or biases from external data and to interpret nuanced, context-dependent climate information.
The study presents ChatClimate, a retrieval-augmented GPT-4 system grounded in IPCC AR6 that improves factuality, citation quality, and timeliness for climate QA. Empirically, the hybrid ChatClimate outperforms baseline GPT-4 and often the IPCC-only variant across 13 questions, aided by careful prompt engineering and tuned retrieval. Key contributions include: demonstrating that authoritative external memory reduces hallucinations and addresses knowledge staleness; showing the importance of prompt design and hyperparameter choices in retrieval; and providing a practical pipeline to convert IPCC reports into searchable vectors for LLM use. Future directions include automated, authoritative fact-checking pipelines; richer evaluation protocols (e.g., inter-annotator agreement, transparent query generation); expanded and regularly updated corpora aligned with new IPCC releases; improved retrieval completeness and chunking strategies; and multi-modal capabilities to handle tables and figures. The authors emphasize that such tools should complement, not replace, expert-driven decision-making.
- Evaluation relied on expert judgment without a fully specified protocol (e.g., inter-annotator agreement); broader, standardized evaluation is needed. - The system currently does not support table querying or figure interpretation; multi-modal capabilities are planned. - Retrieval completeness and sufficiency depend on hyperparameters (top-k) and chunk size; critical text may be missed by semantic search. - The hybrid mode’s allowance for in-house knowledge can reintroduce hallucination risks, as shown by the Scotland glaciers example. - The corpus is centered on IPCC AR6 (plus some related reports in certain comparisons), which may limit coverage of localized or emerging information. - Chain-of-thought strategies were not fully explored. - External data sources may contain inaccuracies or biases; human supervision and reference cross-checking are still required. - The work focuses on prototype-level testing with 13 questions; generalizability across broader question sets and user populations remains to be established.
Related Publications
Explore these studies to deepen your understanding of the subject.

