logo
ResearchBunny Logo
Towards building multilingual language model for medicine

Medicine and Health

Towards building multilingual language model for medicine

P. Qiu, C. Wu, et al.

Discover groundbreaking advancements in multilingual medical language modeling as researchers from Shanghai Jiao Tong University unveil a model that outperforms existing frameworks. With an impressive multilingual medical corpus and a unique question-answering benchmark, this study redefines the boundaries of healthcare AI.... show more
Introduction

Large language models show strong potential in healthcare, with closed-source systems (e.g., GPT-4, Med-PaLM 2) passing medical exams, and open-source models (e.g., MEDITRON, PMC-LLaMA, MedAlpaca, ChatDoctor) improving rapidly. However, most medical LLM research and resources center on English, limiting global impact. General multilingual LLMs (e.g., BLOOM, InternLM 2) still underperform on non-English medical tasks due to sparse medical content in their pretraining data. This work addresses the gap by building an open-source multilingual medical corpus to adapt LLMs, creating a multilingual medical QA benchmark with rationales to monitor progress, and training/evaluating models adapted on the new corpus.

Literature Review

The paper situates its work within: (1) closed-source LLMs (GPT-3.5, GPT-4, Gemini-1.0 Pro, Flan-PaLM/Med-PaLM 2) that achieve strong medical QA performance but lack transparency and training accessibility; (2) open-source medical LLMs (MEDITRON, PMC-LLaMA, MedAlpaca, ChatDoctor) largely focused on English; and (3) open-source multilingual LLMs (BLOOM, InternLM/InternLM 2, Mistral, Llama 2/3, BioMistral, Gemma) that, despite multilingual training, perform suboptimally on medical queries due to under-representation of medical text. Prior benchmarks are predominantly monolingual; multilingual medical QA resources are fragmented (MedQA, IgakuQA, FrenchMedMCQA, RuMedDaNet, Head-QA). The authors identify the need for a large-scale multilingual medical corpus and a unified multilingual medical benchmark with rationales to drive and measure progress.

Methodology

Data construction: MMedC is a large-scale multilingual medical corpus with ~25.5B tokens across six languages (English, Chinese, Japanese, French, Russian, Spanish) compiled from four sources: (i) filtering medical-related content from a general multilingual web corpus (CulturaX/CommonCrawl) using a rule-based pipeline with 200 medical keywords per language and two metrics—Medical Keyword Count (MKC) and keyword density (DENS)—with language-specific thresholds; manual spot-checking of 100 sentences per language yielded ~98% medical relevance; (ii) OCR-extracted content from >20,000 medical textbooks (PaddleOCR), removing non-content pages; approximate token yields include 4B (English), 1.1B (Chinese), 0.4B (Russian), 0.3B (French); (iii) curated medical websites (encyclopedias, consultation/popular science, news) to capture authoritative and up-to-date content (e.g., ~0.1B Japanese tokens, ~0.05B Spanish, ~0.1M French); (iv) existing corpora (Wikipedia, Baidu Baike filtered for medical entries; UFAL Medical Corpus). Auto-regressive training: Existing LLMs (InternLM, BLOOM, Llama 3, etc.) are further pre-trained on MMedC with next-token prediction. Text is chunked into 2048 tokens with 512 overlap; training uses FSDP, BF16, gradient checkpointing; typical global batch size 512, learning rates around 2e-5 (InternLM) or 8e-6 (BLOOM); trained on 8×A100 for 20k iterations (~20 days). Benchmark construction (MMedBench): Aggregate multilingual medical MCQ datasets across six languages (MedQA EN/ZH, IgakuQA JA, FrenchMedMCQA FR, RuMedDaNet RU re-framed as binary choice, Head-QA ES). Total 53,566 QA pairs; splits: 45,048 train and 8,518 test. Rationale augmentation: For each QA, GPT-4 generates a rationale given the question, options, and correct answer under language-specific prompting; a rationale test subset of 1,200 was human-verified by three annotators with 1,136 passing (94.7%). Topic labeling: GPT-4 assigns questions to 21 medical subjects (e.g., Internal Medicine, Biochemistry, Pharmacology, Psychiatry, etc.), with clinician verification. Evaluation settings: zero-shot (closed-source via API), parameter-efficient fine-tuning (LoRA/PEFT), and full fine-tuning for open-source models. Metrics: MCQ accuracy and rationale similarity via BLEU, ROUGE, and BERTScore; relative human and GPT-4 ratings were also collected (50 samples per language across six models) to assess correlations with automatic metrics. Implementation details for fine-tuning: max sequence length 2048; full-model FT with FSDP, BF16, gradient checkpointing, global batch 128, LR 1e-6; LoRA rank 16 with similar settings.

Key Findings
  • Corpus and benchmark: MMedC compiles ~25.5B tokens across six languages (English ~42% share, Russian ~7%); MMedBench contains 53,566 QA pairs across 21 medical topics with 45,048 train and 8,518 test; rationale subset: 1,136 human-verified; avg rationale length ~200 tokens; ~9–10% multi-answer questions in test.
  • Multilingual MCQ accuracy (MMedBench): After full fine-tuning, strong general LLMs outperform earlier baselines; Llama 3 (8B) averages 62.79. Additional auto-regressive training on MMedC improves results substantially: MMed-Llama 3 (8B) reaches 67.75 average across six languages, surpassing all open-source models evaluated, approaching GPT-4 zero-shot (74.27). Under PEFT, similar gains are observed (e.g., InternLM 2 vs. MMedLM 2; Llama 3 vs. MMed-Llama 3).
  • Rationale generation: Across BLEU-1/ROUGE-1, models trained on MMedC improve rationale similarity; full FT results show MMedLM 2 achieving the best average rationale scores among compared models (BLEU-1/ROUGE-1 Avg 48.81/45.29), with MMed-Llama 3 close behind (47.21/43.29). Correlation analysis indicates GPT-4-as-judge rankings correlate best with human ratings (Kendall r=0.660), and among absolute metrics, BERTScore correlates best (r=0.538), suggesting it as the preferred automatic metric for rationale assessment.
  • Human and GPT-4 ratings: MMed-Llama 3 attains the highest relative human rating (4.10) and GPT-4 rating (4.73) among six evaluated models, aligning with automatic metric trends.
  • English benchmarks: MMed-Llama 3 (8B) achieves state-of-the-art open-source performance without special prompting: MedQA 65.4, MedMCQA 63.5, PubMedQA 80.1, and MMLU-Med average 72.59, surpassing GPT-3.5’s 67.69 and outperforming prior open-source models in most tasks.
  • Ablations: Adding rationales during SFT improves MCQ accuracy by ~2.3–4.1 points depending on backbone; English-only MMedC training can overfit English and harm other languages; multilingual MMedC training alleviates this and boosts accuracy and rationale metrics; adding automatically filtered US-Data further lifts accuracy (e.g., MMed-Llama 3 from 64.40 to 67.75) and rationale scores.
Discussion

The study demonstrates that targeted auto-regressive pretraining on a large multilingual medical corpus (MMedC) equips general LLMs with domain-specific multilingual medical knowledge, yielding substantial gains in both MCQ accuracy and rationale generation across six languages. These gains address the initial challenge that general multilingual LLMs underperform on non-English medical tasks due to sparse medical content in their pretraining data. The newly curated MMedBench enables standardized multilingual evaluation on both answers and rationales, revealing that rationale supervision during fine-tuning improves both reasoning quality and answer accuracy. Correlation analyses support using BERTScore as a practical automatic metric for rationale assessment when human evaluation is infeasible, with GPT-4-as-judge closely mirroring human preferences. Overall, stronger backbones and richer, higher-quality multilingual medical data synergize to advance multilingual medical QA, narrowing the gap to top closed-source models while retaining openness and reproducibility.

Conclusion

This work introduces three key assets for multilingual medical NLP: (1) MMedC, a 25.5B-token multilingual medical corpus across six languages built from filtered web data, textbooks, medical websites, and existing corpora; (2) MMedBench, a multilingual medical QA benchmark with human-verified rationales; and (3) a family of adapted models, culminating in MMed-Llama 3 (8B), which achieves leading open-source performance on MMedBench and competitive results on English benchmarks. The study shows that multilingual, domain-specific pretraining plus rationale-supervised fine-tuning significantly enhances both accuracy and reasoning. The authors release datasets (with license caveats for books), code, models, and human rating data to facilitate further research. Future directions include expanding to more languages (e.g., German, Arabic), incorporating retrieval-augmented architectures to mitigate hallucinations, scaling to larger models, and strengthening safety and bias controls.

Limitations
  • Potential biases from web-crawled data may disadvantage underrepresented populations; stronger safety/bias controls are needed.
  • Explainability remains limited; while rationale generation helps, deeper interpretability of LLM architectures is under-explored.
  • Language coverage is limited to six major languages; many languages remain unsupported and require expansion strategies for low-resource settings.
  • Computational constraints limit the final model to 8B parameters; larger models with retrieval augmentation could improve performance and reduce hallucinations.
  • Licensing restricts redistribution of some book and website data; only uncopyrighted portions of MMedC are publicly released (book list provided for independent acquisition).
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny