logo
ResearchBunny Logo
From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs

Medicine and Health

From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs

Y. Hou, J. Yeung, et al.

In a groundbreaking study, researchers Yu Hou, Jeremy Yeung, Hua Xu, Chang Su, Fei Wang, Weill Cornell Medicine, and Rui Zhang unveil the comparative capabilities of ChatGPT and Biomedical Knowledge Graphs in biomedical knowledge discovery and reasoning tasks. Discover how ChatGPT outstrips previous models while uncovering the strengths of BKGs for reliable information.

00:00
00:00
~3 min • Beginner • English
Introduction
Large Language Models such as GPT-3.5 and GPT-4, trained on vast text corpora, can generate human-like text and support tasks including question answering. Concurrently, knowledge graphs provide structured, semantically linked representations enabling discovery, entity linking, and semantic querying. Biomedical Knowledge Graphs (BKGs) integrate curated databases and literature-extracted relations across heterogeneous biomedical entities and relations. Despite LLMs’ promise, in biomedical domains they can produce erroneous or inconsistent answers. This study evaluates ChatGPT against BKGs for biomedical question answering, knowledge discovery (e.g., repurposing drugs and dietary supplements), and reasoning (establishing associations), to clarify complementary roles and illuminate integration opportunities.
Literature Review
The paper situates the work within ongoing developments in BKG construction and application, including integration of curated sources and literature-mined relations to support tasks such as drug repurposing, interpretation of omics data, and precision medicine knowledgebases. Prior efforts include integrative biomedical hubs and heterogeneous networks for AD, as well as knowledge-driven drug repurposing using comprehensive drug KGs. In parallel, literature has assessed LLMs’ capabilities and risks (e.g., hallucinations) and explored combining KGs with conversational agents to enrich responses. This study builds on these strands by empirically comparing ChatGPT (GPT-3.5/4.0) with BKGs for biomedical Q&A, discovery, and reasoning.
Methodology
Compared systems: (1) iDISK, an integrated dietary supplements knowledge base standardized from multiple authoritative DS resources, used for DS-related exploration; (2) iBKH, an integrative biomedical knowledge hub aggregating 18 curated sources with ~2.2M entities (11 types) and 45 relation types across 18 categories, used for drug-related tasks; and (3) ChatGPT (GPT-3.5 and GPT-4.0). Question answering: 50 questions were initially sampled (5 per category) from Yahoo! Answers “Alternative Medicine”; 43 were ultimately evaluated. For ChatGPT, questions were input as prompts and responses recorded. For BKGs, questions were mapped to entities and relations (e.g., identify subject concept ID and relation such as has_adverse_reaction in iDISK), retrieve connected entities, and convert records into natural language answers. Evaluation followed TREC LiveQA guidelines with two medically trained experts scoring responses 0–3 (0 incorrect/poor; 1 incorrect but related/fair; 2 correct but incomplete/good; 3 correct and complete/excellent). Metrics included average score and succ@k+ (proportion scoring at least k) for k=1,2,3. Statistical comparisons used t-tests or Mann-Whitney U tests based on normality (assessed via QQ-plot) in R with the car package. Knowledge discovery (AD repurposing): Prompts asked ChatGPT to suggest approved drugs not currently used in AD but potentially available for AD treatment, and DSs potentially treating/preventing AD, with rationale; each prompt repeated 10 times. Suggestions were checked against existing BKGs (iBKH for drugs, ADInt for DSs), clinical trials, and literature support. For BKG-based discovery, knowledge graph embedding (KGE) models learned embeddings for entities and relations, and link prediction identified candidate drug/DS–AD relations absent from the current graph. Knowledge reasoning: Scenario-based prompts asked ChatGPT to present direct/indirect associations between tested drugs/DSs and AD in structured triplets with references. BKGs were queried to find shortest paths and supporting literature where available, enabling comparison of reasoning and citation validity.
Key Findings
- Q&A performance on 43 questions: GPT-4.0 achieved an average score of 2.12 ± 0.83 and outperformed iDISK (1.64 ± 0.81) and GPT-3.5 (1.44 ± 0.98) with p-values < 0.05 for comparisons to GPT-4.0; GPT-3.5 and iDISK were comparable (p=0.20). succ@1+: GPT-4.0=1.00; iDISK=0.88; GPT-3.5=0.79. succ@2+: GPT-4.0=0.70; iDISK=0.61; GPT-3.5=0.51. succ@3+: GPT-4.0=0.42; GPT-3.5=0.14; iDISK=0.12. - Reference reliability: iDISK provided database-level provenance; ChatGPT (both versions) frequently produced fabricated article references. - Knowledge discovery (ChatGPT): Across 10 iterations (GPT-3.5), suggested drugs (e.g., Levetiracetam 6/10; Lithium 5/10; Ibuprofen 5/10) largely overlapped with existing clinical trials and literature on AD. Of 16 suggested drugs, 12 had direct links to AD in iBKH; the remainder had shortest path length 2 (e.g., Drug–targets→Gene–associates→AD). Suggested DSs were included in AD clinical trials and directly linked to AD in ADInt (DS–[Treats/Prevents]→AD). GPT-4.0 showed greater diversity but suggestions still appeared in existing trials/literature. - Knowledge discovery (BKG): KGE-based link prediction generated drug/DS candidates not approved or trialed for AD; example paths and literature support included Loperamide–[TARGET]→Opioid Receptors–[ASSOCIATE_WITH]→AD (supported by literature) and Caryophyllus aromaticus–[INHIBITS]→Kynurenine–[AFFECTS]→AD (supported by literature). - Reasoning and citations: ChatGPT often failed to produce valid structured links and reliable references; in some cases GPT-4.0 reported no association where BKGs revealed plausible paths with literature support. Overall, BKGs were more reliable for structured reasoning and provenance, whereas GPT-4.0 excelled at generating comprehensive Q&A responses.
Discussion
The study’s findings indicate that while GPT-4.0 improves over GPT-3.5 and matches or exceeds BKG-based approaches in generating answers to existing biomedical questions, BKGs remain more reliable due to curated, structured sources and explicit provenance. For knowledge discovery, ChatGPT tended to suggest entities already present in BKGs, clinical trials, or literature, reflecting limited capacity for novel hypothesis generation beyond its training data. In reasoning, BKGs better established structured, verifiable links between entities (e.g., shortest paths with relation types and supporting citations), whereas ChatGPT responses frequently contained unverifiable references or lacked structured paths. These results suggest complementary strengths: LLMs provide fluent, comprehensive responses that can surface relevant known information, while BKGs provide accuracy, structure, and traceable evidence. Integrating LLMs with BKGs could enhance domain-specific performance by combining generative flexibility with authoritative knowledge and provenance.
Conclusion
ChatGPT (especially GPT-4.0) outperformed GPT-3.5 and performed comparably or better than BKGs for biomedical Q&A but lagged in reliability, reasoning, and novel discovery. BKGs provided more dependable, structured, and citable knowledge. The study recommends integrating LLMs with BKGs to leverage complementary strengths, improve performance across tasks, and mitigate risks, thereby advancing biomedical knowledge and contributing to individual well-being.
Limitations
- Reference validity: ChatGPT frequently produced fabricated references, necessitating external verification for trustworthiness. - Scope of evaluation: Q&A was limited to 43 questions from the Yahoo! Answers “Alternative Medicine” category; discovery focused on AD drug/DS repurposing; findings may not generalize across biomedical domains. - System coverage: The BKG comparison used specific graphs (iDISK, iBKH, ADInt) and may reflect their coverage and curation quality. - Manual assessment: Expert scoring for Q&A introduces potential subjectivity despite established guidelines. - Statistical testing depended on normality assessments; small sample sizes in some comparisons may limit power.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny