logo
ResearchBunny Logo
Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature

Environmental Studies and Forestry

Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature

A. V. Gougherty and H. L. Clipp

This groundbreaking research by Andrew V. Gougherty and Hannah L. Clipp reveals how a large language model (LLM) can extract ecological data from scientific literature over 50 times faster than human reviewers, while achieving remarkable accuracy. Discover its potential for creating extensive ecological databases, but also the essential need for quality assurance to ensure data integrity!

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether a large language model can reliably extract ecological information from scientific literature, addressing concerns about LLM hallucinations, bias, and lack of transparency. The context is the surge in AI chatbots capable of processing large volumes of text and their proposed role in offloading laborious tasks in science. The authors focus on emerging infectious diseases (EIDs) of plants reported as first detections on new hosts or in new regions, which are abundant and important for understanding invasive species spread and management. The research question asks how quickly and accurately an LLM can extract key ecological data (pathogen, host, timing, location, incidence, and coordinates) compared with a human reviewer, and what types of data are more or less reliably extracted.
Literature Review
The paper situates the work within broader discussions on LLM capabilities and risks, including hallucinations, bias from training data, and limited transparency. Prior commentary suggests LLMs are most effective on simple, straightforward tasks not requiring multi-step reasoning. The authors also reference the growing role of AI in ecology and biodiversity, and note multilingual coverage can reduce biases in literature synthesis. Environmental costs of expanding LLM usage are highlighted as an area needing reflection.
Methodology
- Source texts: The first 100 short, data-dense disease reports of emerging infectious tree diseases from a prior global study (ref. 7), each documenting a pathogen detected on a new host or in a new geographic region. - LLM: Google text-bison-001 (generative text model) accessed via API to enable a fully scripted workflow. Chosen to return concise, structured outputs without conversational padding. - Extraction targets: (i) pathogen scientific name, (ii) host scientific name(s), (iii) pathogen incidence (% infected hosts), (iv) year sampled/observed, (v) location where observed, (vi) country, and (vii) latitude/longitude in decimal degrees. - Prompt engineering: Iterative refinement to specify formats (e.g., require decimal degrees for coordinates, scientific rather than common names). Output requested as a table string with vertical bar delimiters to avoid ambiguity from commas in locations. - Workflow: For each report, its title and text were appended to a standardized prompt detailing the items and formats required. The LLM’s tabular text output was parsed programmatically. A human reviewer independently extracted the same variables from the reports for comparison. - Evaluation: Measured total extraction time for LLM vs reviewer and assessed agreement/accuracy across data types. Agreement metrics included exact match rates and Cohen’s Kappa (with CIs) where applicable. For coordinates, decimal degree conversion accuracy and absolute differences in latitude/longitude were quantified; geocoding behavior was also assessed when coordinates were absent. - Timing: Total time for the LLM to process all 100 reports versus manual review time was recorded to compare efficiency.
Key Findings
- Speed: LLM completed data extraction in ~5 minutes versus ~268 minutes for the human reviewer (>50-fold faster). - Pathogen identity: 98.1% match (101/103) with reviewer; Cohen’s Kappa = 0.98 (CI 0.95–1.0). Discrepancy involved alder yellows where LLM returned a species-like binomial not explicitly in the report. - Host identity: 91.7% exact matches (121/132); Kappa = 0.92 (CI 0.87–0.96). Errors mostly omissions when multiple hosts were listed. - Year observed: Overall accuracy 72.1% across 147 cases: 106 true positives, 11 false positives, 14 mismatches, 15 omissions, 1 commission. Mismatches often involved multi-year observation ranges. - Country: 100% exact matches with reviewer (Kappa = 1.0). - Coordinates when supplied (N = 34 reports; 44 unique first-record locations; 46 unique locations identified by both): - Exact matches (beyond negligible rounding): 34.0%. - Conversion discrepancies for 16 locations: mean absolute diff lat 0.1369, lon 0.0022. - Conversion failures for 8 locations: mean absolute diff lat 0.1733, lon 0.1097. - Additional issues: 4 omissions, 2 commissions, 1 conflation of two locations. - Excluding the 7 latter cases, absolute differences ranged 0–1.8383 (mean 0.1052) for latitude and 0–0.2800 (mean 0.0270) for longitude. - Reviewer outperformed LLM in interpreting context: avoided survey coordinates not representing first records; detected a report’s erroneous longitude. - Geocoding when coordinates absent: LLM produced coordinates for 70 unique locations with 98.6% correct country placement; precision uncertain with 3 points in bodies of water (40 m to 5.2 km from shore) and one set ~5 km into an incorrect country. - Incidence (quantitative): Overall accuracy 23.8% across 147 cases: 25 true positives, 10 true negatives, 95 false positives, 1 mismatching value, 15 omissions, 1 commission. LLM frequently assigned 100% incidence when data were not reported (53/100 such reports). When a reviewer-reported incidence existed, LLM matched 96.2% (25/26).
Discussion
The findings show that an LLM can rapidly and accurately extract discrete and categorical ecological information (pathogen and host scientific names, countries, and often years) from short, data-dense reports, substantially accelerating database creation. However, the model struggled with some quantitative tasks—most notably incidence estimates—and with reliably converting or interpreting geographic coordinates, sometimes returning spurious values or failing to detect contextual cues (e.g., survey versus first-record coordinates, erroneous inputs). The LLM’s ability to distinguish pathogen versus host entities, including parasitic plants as pathogens, suggests effective use of context rather than reliance on formatting alone. Automatic geocoding adds value for subsequent environmental data linkage, though precision and internal consistency checks are lacking without explicit validation steps. Overall, LLM-assisted extraction can address the research question by enabling large-scale, timely synthesis of ecological records, but must be paired with quality assurance to ensure data integrity, especially for quantitative variables.
Conclusion
LLMs can dramatically increase the speed of building ecological databases from literature with high accuracy for discrete and categorical fields, offering a path to scale surveillance and analyses of emerging plant diseases and invasions. Nevertheless, quantitative extractions (e.g., incidence) and coordinate handling require additional safeguards. Future work should include rigorous quality assurance protocols, fine-tuning or domain-specific training to improve quantitative extraction, systematic comparisons across multiple LLMs, evaluation on longer and more complex texts, and methods to assess and improve geocoding precision and internal consistency. Broader considerations include leveraging multilingual capabilities to reduce literature bias and assessing environmental costs of LLM use.
Limitations
- Source selection bias: Reports were known a priori to contain relevant pathogen, host, and geographic data, and were short and data-dense; performance on longer, more complex articles remains uncertain. - Task simplicity: Extracted variables were relatively simple (names, country, year), potentially inflating accuracy relative to more complex tasks (e.g., effect sizes for meta-analyses). - Quantitative shortcomings: Poor performance on incidence extraction, frequent assignment of 100% when data were absent, and challenges converting coordinates to decimal degrees. - Context interpretation limits: Inability to detect mismatched or non-first-record coordinates and to validate internal consistency without explicit instruction. - Geocoding precision: While country placement was highly accurate, precise locations were sometimes implausible (e.g., points in water), and method for geocoding is opaque. - General LLM concerns: Potential training data biases, lack of transparency, and environmental costs associated with large-scale LLM usage. - Single-model case study: Results pertain to one publicly available LLM (text-bison-001); performance may differ across models and versions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny