logo
ResearchBunny Logo
A multimodal generative AI copilot for human pathology

Medicine and Health

A multimodal generative AI copilot for human pathology

M. Y. Lu, B. Chen, et al.

Discover how PathChat, developed by a team of experts including Ming Y. Lu and Bowen Chen, revolutionizes pathology with its state-of-the-art performance in answering diagnostic questions and generating responses preferred by pathologists. This innovative AI assistant shows great potential for enhancing pathology education, research, and clinical decision-making.... show more
Introduction

The study addresses the need for a general-purpose, multimodal AI assistant in human pathology that can integrate visual histopathologic features with natural language to support diagnostic reasoning, education and research. While computational pathology has benefited from large-scale self-supervised vision encoders and specialized predictive models, these developments have not fully leveraged natural language as a medium for supervision, interaction, and knowledge integration. The purpose is to build and evaluate PathChat, a multimodal generative AI that combines a pathology-specialized vision encoder with a large language model to reason over images and text, aiming to improve diagnostic accuracy, interpretability, and utility in real-world, human-in-the-loop workflows. The importance lies in enabling flexible, instruction-driven interactions with pathology images and clinical context, potentially democratizing training, supporting complex diagnostic workflows, and enhancing research productivity.

Literature Review

Prior work in general machine learning shows that large-scale vision–language representation learning augments vision-only models with zero-shot recognition and retrieval capabilities. In medical imaging, researchers have begun leveraging paired biomedical images and text (captions/reports) to train visual–language systems, including CLIP-like models adapted to domains such as pathology and radiology. In computational pathology, some models demonstrate zero-shot performance in select diagnostic and retrieval tasks, and specialized systems have been explored for biomedical visual question answering and image captioning. However, existing efforts are either task-specific or limited in scope, lacking a comprehensive, generalist multimodal assistant optimized for pathologist-facing use cases and interactive reasoning over histology images plus clinical context. This work builds on these trends by integrating a large-scale pathology vision encoder with an LLM and conducting extensive instruction tuning and evaluation, including comparisons with GPT-4 and open-source multimodal baselines.

Methodology

Model development: The authors began with UNi, a state-of-the-art vision-only encoder pretrained via self-supervised learning on approximately 100 million patches from about 100,000 whole-slide images. They then performed vision–language pretraining using 1.18 million pathology image–caption pairs to align visual features with pathology text embeddings. The resulting vision encoder was connected to a 13-billion-parameter pretrained Llama 2 LLM via a multimodal projection module to form a multimodal large language model (MLLM). The full system was instruction-tuned using a curated dataset of 456,916 pathology-specialized instructions comprising 999,292 conversational turns, spanning formats such as multi-turn dialogues, multiple-choice questions, and short answers.

Evaluation design: Multiple-choice diagnostic questions were constructed using ROIs from H&E-stained WSIs drawn from TCGA and an in-house archive, covering 54 diagnoses across 11 organ systems. Two settings were tested: (1) image-only (histology ROI plus question) and (2) image plus clinical context (adding age, sex, history, radiology), to mimic real-world diagnostic workflows. Benchmarks included a combined set (n = 155), PathQABench-Public (n = 52), and PathQABench-Private (n = 53). Comparators included LaVA-1.5, LaVA-1.6, and for public cases, GPT-4-V.

Open-ended evaluation: A set of 260 expert-curated open-ended questions spanning microscopy description, grade/differentiation, risk factors, prognosis, treatment, diagnosis, IHC/molecular testing, and ancillary tests were posed to models without task-specific fine-tuning. Seven blinded pathologists independently ranked model responses per question for relevance, correctness, and reasoning quality. Separately, two board-certified pathologists performed blinded binary correctness assessments per case; disagreements were adjudicated to consensus, yielding a 235-question consensus subset for accuracy calculation. Additional subgroup analyses by question category were performed.

Statistical reporting: Accuracy with 95% confidence intervals was reported for multiple-choice tasks; win/lose/tie head-to-head rates across models were summarized for open-ended rankings. P-values (P < 0.001 where noted) were used to assess significance of performance differences.

Key Findings
  • Instruction dataset and pretraining: Curated 456,916 instructions (999,292 turns) and 1.18 million image–caption pairs; vision encoder pretrained on ~100 million patches from ~100,000 slides.
  • Multiple-choice diagnostics: PathChat outperformed LaVA-1.5 and LaVA-1.6 in both settings. Image-only accuracy on combined benchmark: 79.1% (P < 0.001 versus both baselines). With clinical context, accuracy improved to 89.5% (P < 0.001 versus both baselines). Adding clinical context improved PathChat accuracy for private (PathQABench-Private) by +11.3% and public (PathQABench-Public) by +11.6%.
  • Dependence on visual input: When the image was not provided (clinical context only), performance dropped substantially, indicating that predictive power derives primarily from visual features.
  • Open-ended questions (260 total): Seven blinded pathologists ranked responses. Against GPT-4, PathChat’s median head-to-head win rate was 56.5% (lose 23.3%, tie 21.2%); against another multimodal baseline (LaWALS), win rate was 67.7% (lose 11.2%, tie 21.5%). PathChat had the highest mean win rate across models (71.0%).
  • Open-ended accuracy (consensus subset, n = 235): PathChat achieved 78.5% accuracy; GPT-4 achieved 70.8%; a publicly available baseline (LLMIVA-L Med) achieved 52.3%.
  • Subgroup analyses: For categories requiring image examination (microscopy and diagnosis), PathChat showed strong head-to-head win rates vs GPT-4 (median 70.6% and 71.3%, respectively) and low lose rates (13.8%). Reported accuracies on the consensus subset were 73.3% (microscopy) and 78.5% (diagnosis) for PathChat versus 82.2% and 31.6% for GPT-4.
  • Usability: Demonstrated multi-turn interactive use cases (e.g., proposing additional IHC/molecular testing, integrating clinical context), indicating potential for human-in-the-loop workflows in complex diagnostics and education.
Discussion

PathChat demonstrates that a pathology-specialized multimodal assistant can integrate histology image understanding with natural language reasoning to deliver high diagnostic accuracy, detailed morphological descriptions, and clinically relevant suggestions. Across both multiple-choice and open-ended evaluations, it compared favorably to GPT-4 and surpassed open-source multimodal baselines, particularly in tasks that depend on histomorphologic interpretation. The system’s ability to incorporate clinical context further improves performance, reflecting real-world diagnostic workflows. These results suggest that multimodal generalist assistants tailored to pathology could be impactful in education, research, and clinical decision support. However, careful alignment with human intent, safeguards against hallucinations, and rigorous validation will be essential for safe deployment.

Conclusion

This work introduces PathChat, a multimodal generative AI copilot for human pathology that couples a large-scale pathology vision encoder with a 13B-parameter LLM and is instruction-tuned on a large pathology-focused corpus. PathChat achieves state-of-the-art performance on multiple-choice diagnostics and generates pathologist-preferred responses on open-ended tasks, outperforming open-source baselines and comparing favorably to GPT-4, especially for image-dependent reasoning. The study provides a comprehensive evaluation framework and demonstrates potential applications in education, research, and human-in-the-loop clinical workflows. Future directions include support for whole-slide image inputs, continual updates to reflect evolving guidelines and terminology, retrieval augmentation with up-to-date knowledge bases, explicit support for fine-grained localization and counting tasks, and integration with digital slide viewers and EHR systems.

Limitations
  • Evaluation and training data are retrospective and may contain outdated terminology or guidelines, risking temporally inconsistent outputs.
  • Current model does not natively process entire gigapixel whole-slide images; evaluations rely on preselected ROIs, which may limit context.
  • Some clinical and ancillary testing questions are primarily knowledge-retrieval tasks where larger general-purpose models (e.g., GPT-4) may excel; PathChat lagged somewhat in these categories.
  • Potential for hallucinations and misinterpretation in complex or ambiguous cases; further alignment (e.g., RLHF) and guardrails are needed.
  • Real-world deployment requires implementation research, validation for consistency and reproducibility, and mechanisms to detect and abstain from invalid or non-pathology queries.
  • Occasional model refusals or guardrail-triggered behaviors (noted for GPT-4 during benchmarking) indicate operational constraints that can affect comparative evaluations.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny