Medicine and Health

Towards building multilingual language model for medicine

P. Qiu, C. Wu, et al.

Discover groundbreaking advancements in multilingual medical language modeling as researchers from Shanghai Jiao Tong University unveil a model that outperforms existing frameworks. With an impressive multilingual medical corpus and a unique question-answering benchmark, this study redefines the boundaries of healthcare AI.

00:00

Playback language: English

Index

Introduction

Large language models (LLMs) have shown significant potential in healthcare, with closed-source models like GPT-4 and MedPalm-2 achieving remarkable performance. Open-source models like Llama 2 have also facilitated the development of specialized medical LLMs, but these advancements have primarily focused on English. Existing multilingual LLMs, while trained on diverse corpora, underperform on medical queries in non-English languages due to insufficient medical content in their training data. This paper aims to address this gap by developing an open-source, multilingual language model for healthcare. The approach involves creating a large-scale multilingual medical corpus, a comprehensive multilingual medical question-answering benchmark, and evaluating various LLMs, including those fine-tuned on the newly created corpus. The goal is to provide a robust foundation for future research in multilingual medical LLM development and offer valuable insights into the capabilities and challenges of processing multilingual medical queries.

Literature Review

Recent literature highlights the success of LLMs in healthcare, particularly closed-source models like GPT-4 and MedPalm-2, which have demonstrated impressive performance on medical examinations. Open-source initiatives have also contributed to the development of medical-specific LLMs, such as MEDITRON, PMC-LLaMA, MedAlpaca, and ChatDoctors, gradually closing the performance gap with their closed-source counterparts. However, a significant limitation of these models is their predominant focus on English, limiting their accessibility and applicability to a globally diverse population. Open-source multilingual LLMs like BLOOM and InternLM 2, while trained on diverse multilingual corpora, exhibit unsatisfactory performance on medical queries in non-English languages. This shortcoming stems from the underrepresentation of medical content in general-purpose multilingual datasets.

Methodology

The researchers developed a large-scale multilingual medical corpus (MMedC) containing over 25.5 billion tokens across six languages (English, Chinese, Japanese, French, Russian, and Spanish). Data was sourced from four main avenues: (1) filtering medical-related content from a large multilingual corpus using keyword matching and density metrics; (2) curating medical textbooks and converting them into text using OCR and data filtering; (3) incorporating text from open-source medical websites; and (4) integrating existing smaller multilingual medical datasets. A comprehensive multilingual medical question-answering benchmark (MMedBench) was also created by aggregating existing medical multiple-choice QA datasets across the six languages and augmenting them with rationale content generated by GPT-4 and verified by human experts. The benchmark consists of 53,566 QA pairs across 21 medical fields. Eleven existing LLMs, including both closed-source (GPT-3.5, GPT-4, Gemini-1.0 pro) and open-source models (BLOOM, InternLM, InternLM 2, MedAlpaca, ChatDoctor, PMC-LLaMA, Mistral, BioMistral, MEDITRON, Llama 2, Llama 3), were evaluated on MMedBench under zero-shot, parameter-efficient fine-tuning (PEFT), and full fine-tuning settings. The models were assessed on both multiple-choice accuracy and rationale generation ability using automatic metrics (BLEU, ROUGE, BERT-score) and human ratings. Ablation studies were conducted to analyze the impact of different data components on model performance. Finally, additional experiments on public English benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Medical) were performed to assess the performance of the final model in comparison with the state-of-the-art.

Key Findings

Models fine-tuned on MMedC consistently outperformed their baselines, highlighting the value of the multilingual medical corpus. The final model, MMed-Llama 3, achieved superior performance on both multilingual and English-only benchmarks, even rivaling GPT-4 in accuracy. Medical-specific LLMs generally exhibited high accuracy in English but significantly lower performance in other languages. Fine-tuning on MMedBench significantly improved the accuracy of open-source LLMs, with MMed-Llama 3 achieving 67.75% accuracy on the multilingual MMedBench and surpassing other models on English benchmarks. In rationale generation, MMed-Llama 3 also achieved the highest scores in both human and GPT-4 evaluations. The ablation study revealed that incorporating high-quality data and rationales significantly improved the model's performance. BERT-score was identified as a reliable automatic metric for evaluating rationale generation. The English benchmark evaluations showed that MMed-Llama 3 demonstrated state-of-the-art performance, significantly outperforming other open-source models and even exceeding GPT-3.5 on some tasks.

Discussion

The results demonstrate the effectiveness of auto-regressive training on MMedC and the importance of incorporating high-quality multilingual data and rationales for fine-tuning. The superior performance of MMed-Llama 3 on both multilingual and English benchmarks highlights the potential of open-source multilingual medical LLMs. The study's findings have significant implications for advancing general medical AI (GMAI), improving retrieval-augmented generation, and addressing the challenges of language barriers and cultural sensitivities in healthcare. The ability of MMed-Llama 3 to rival GPT-4 suggests that open-source models can achieve state-of-the-art performance with appropriate data and training strategies.

Conclusion

This work presents MMedC, a large-scale multilingual medical corpus; MMedBench, a comprehensive multilingual benchmark; and MMed-Llama 3, a high-performing multilingual medical LLM. The superior performance of MMed-Llama 3 demonstrates the potential of open-source multilingual models in healthcare. Future work could focus on expanding the corpus to include more languages, improving explainability, and exploring larger model architectures with retrieval augmentation.

Limitations

The study acknowledges limitations such as potential biases in the web-crawled data, the need for further research on LLM explainability, and the incomplete coverage of world languages in the current dataset. The computational cost of training large models is also a limitation. Future work should address these issues.

Related Publications

Explore these studies to deepen your understanding of the subject.

Education

Towards an intelligent blended system of learning activities model for New Zealand institutions: an investigative approach

A. Adel and J. Dayan

Environmental Studies and Forestry

Towards a model for road runoff infiltration management

L. Maurer, J. Zumsteg, et al.

Biology

ProtGPT2 is a deep unsupervised language model for protein design

N. Ferruz, S. Schmidt, et al.

Medicine and Health

A scoping review of large language model based approaches for information extraction from radiology reports

D. Reichenpfader, H. Müller, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny