Linguistics and Languages

Scaling neural machine translation to 200 languages

M. R. Costa-jussà, J. Cross, et al.

Explore the groundbreaking No Language Left Behind (NLLB-200), a revolutionary multilingual model developed by esteemed researchers, including Marta R. Costa-jussà and James Cross, that improves translation quality by a staggering 44% across 200 languages using innovative data mining and architectural advancements. This game-changing research is available for non-commercial use.... show more

Introduction

Neural machine translation has advanced rapidly but benefits have concentrated on high-resource languages due to the need for large volumes of parallel data. The NLLB-200 project addresses the central research question: can massively multilingual models represent hundreds of languages—including many low-resource ones—without sacrificing translation quality? The approach relies on cross-lingual transfer in multilingual models to allow related languages to learn from each other while minimizing interference among unrelated languages. The final model supports 200 languages, with three times as many low-resource as high-resource languages, and demonstrates that doubling the number of supported languages can be achieved while maintaining or improving quality. The paper outlines the data-gathering, modeling, and evaluation techniques used to achieve these results and assesses their societal impact, with all resources released for non-commercial use.

Literature Review

The work builds on prior research showing multilingual NMT can outperform bilingual models and enable cross-lingual transfer and zero-shot translation. It references large-scale multilingual efforts (e.g., M2M-100) and benchmarks like FLORES-101 and WMT that catalyzed progress in specific directions. For language identification (LID), previous public systems include CLD3, langid.py, and LangDetect, as well as broader-coverage LID efforts using found web data; however, these suffer from domain imbalance and challenges in similar-language disambiguation and massively multilingual scaling. For data acquisition, prior approaches mine parallel data from web-scale corpora (e.g., ParaCrawl, CCMatrix) and leverage monolingual data via back-translation. For modeling, conditional computation and sparsely gated mixture-of-experts (MoE) architectures (e.g., GShard, BASE layers, GLaM, ST-MoE) provide a foundation for scaling models efficiently while mitigating interference. The paper positions NLLB-200 as advancing these lines through improved LID tuned on FLORES-200, a teacher–student extension of LASER embeddings to many low-resource languages for mining, and MoE-specific regularization and curricula to reduce overfitting.

Methodology

Data creation and mining:

Monolingual data collection at scale using an enhanced LID system: a fastText-based classifier trained on FLORES-200 with hand-written linguistic rules to reduce spurious correlations and improve recognition across 200 languages. Performance was compared against CLD3, langid.py, and LangDetect on intersecting language sets, optimizing both micro-F1 and micro-FPR.
Bitext mining via multilingual sentence embeddings: extend the LASER embedding space to all NLLB-200 languages using a teacher–student approach, training students individually or within language families to handle data imbalances and avoid recomputing embeddings for entire corpora. This enabled mining parallel sentences from web sources (e.g., Common Crawl), yielding over 1.1 billion new sentence pairs across 148 languages. Back-translation was used as additional augmentation.

Modeling:

Multilingual conditional computation using sparsely gated Mixture-of-Experts (MoE) Transformers with MoE layers in encoder and decoder. Each MoE layer contains E feed-forward experts and a gating network routing tokens to top-2 experts, trained with a combination of label-smoothed cross-entropy and auxiliary load-balancing loss.
Overfitting mitigation for low-resource directions: introduce Expert Output Masking (EOM), a MoE-specific regularization method that outperforms overall dropout and gating dropout. Combine with curriculum learning (CL) that phases the introduction of language pairs based on empirical overfitting behavior (delaying pairs by K updates toward the end of training) to reduce overfitting while allowing continued learning for others.

Evaluation framework:

Automatic benchmark: FLORES-200, expanding FLORES-101 to cover 200 languages with a rigorous 4-phase workflow (alignment; translation with iterative QA; final QA; completion). Language sets pass a 90/100 human quality threshold; three splits (dev, devtest, test) are produced. Training data were filtered to exclude any overlap with evaluation sets via hashing.
Human evaluation: Cross-lingual Semantic Text Similarity (XSTS), a calibrated, meaning-focused protocol on a 5-point scale (3 as acceptability threshold). Common calibration items are included across annotators and languages, improving inter-evaluator consistency.
Toxicity evaluation and filtering: ETOX detects added toxicity (asymmetric toxic content across bitext pairs). A toxicity-aware bitext filtering option removes pairs with multiple instances of added toxicity to reduce misalignment-induced toxicity and improve translation performance.

Key Findings

Language identification (LID): On FLORES-200 intersecting sets, the proposed LID achieved higher performance than CLD3, langid.py, and LangDetect, with micro-F1 up to 99.4 and micro-FPR as low as 0.0084, indicating stronger suitability for extracting low-resource languages from web corpora.
Bitext mining and data augmentation: Teacher–student LASER extension enabled mining >1.1 billion new parallel sentence pairs for 148 languages. Back-translation yielded substantial gains, e.g., +12.5 chrF++ for very low-resource to English.
MoE regularization and curriculum learning: Expert Output Masking (EOM) and curriculum learning (CL) reduced overfitting and improved low/very-low resource directions. On FLORES-200 dev: • Average chrF++ improved from 47.6 (Baseline MoE) to 48.5 (MoE+CL) to 51.0 (MoE+CL+EOM). • For very low-resource, xx→eng_Latn improved from 52.5 (Baseline) to 55.4 (MoE+CL) to 56.4 (MoE+CL+EOM); eng_Latn→xx very low improved from 39.0 to 39.5 to 41.2.
Overall MT quality: NLLB-200 outperformed the nearest state-of-the-art by ~+7.3 spBLEU on average, amounting to a 44% improvement in translation quality, despite supporting 200 languages and ~40,000 directions. On FLORES-101 subsets (87 languages), NLLB-200 surpassed M2M-100 and DeepNet; on 101 languages, NLLB-200 achieved 24.0 spBLEU average and 41.7 chrF++ average.
Human evaluation correlations: Calibrated XSTS correlates well with automated metrics—Spearman’s R of 0.710 (spBLEU), 0.687 (corpus chrF++), and 0.694 (avg sentence-level chrF++). Calibrated human scores correlate more strongly than uncalibrated ones across metrics.
Human quality levels: Across 55 evaluated directions (8 into English, 27 out of English, 20 other), the mean calibrated XSTS was 4.26; 38/55 scored >4.0 and 52/56 scored >3.0.
Toxicity filtering: In 10 English↔(Somali, Southern Sotho, Twi, Umbundu, Venetian) directions, filtering ~30% of mined bitext improved chrF++ by ~5% and reduced added toxicity (ETOX) by ~5%.
Real-world impact: Within four months of launch, NLLB-200 became the third most-used MT engine by Wikipedia editors (3.8% of published translations), with the lowest deletion rate (0.13%) and highest proportion of translations with <10% modification among compared services.

Discussion

The study demonstrates that massively multilingual NMT can scale to 200 languages—including a majority of low-resource languages—without sacrificing quality by combining targeted data mining, robust LID, and conditional computation with MoE. Cross-lingual transfer is leveraged effectively while limiting interference via architectural choices (MoE) and training strategies (EOM regularization and curriculum learning). The expanded FLORES-200 benchmark and XSTS protocol provide reliable, calibrated evaluation across thousands of directions, and toxicity-aware filtering addresses safety and quality concerns typical in mined web data. The gains over prior SOTA across spBLEU and chrF++—alongside strong human evaluation scores—indicate the approach’s effectiveness and generality, with additional evidence of transferability to new domains via fine-tuning. These results support the feasibility of universal, high-quality translation systems spanning hundreds of languages and suggest practical benefits for low-resource communities.

Conclusion

NLLB-200 introduces a single, massively multilingual translation model covering 200 languages, enabled by: (1) improved LID tuned on FLORES-200; (2) large-scale mining via teacher–student LASER extensions; (3) MoE architectures with Expert Output Masking and curriculum learning to reduce overfitting; and (4) comprehensive evaluation with FLORES-200, calibrated human XSTS, and toxicity analysis (ETOX). The model delivers a ~44% average BLEU improvement over previous SOTA and robust human-rated quality across diverse directions. All benchmarks, data, code, and models are released for non-commercial use, facilitating further research and deployment for low-resource communities. Future work includes extending coverage and domain robustness, deepening safety and toxicity mitigation, enhancing calibration and cross-language consistency in human evaluation, and fostering interdisciplinary, diverse teams to address broader sociotechnical challenges. Policy measures in education, access, and digital literacy remain essential to fully realize equitable multilingual access online.

Limitations

Data availability and domain balance: Low-resource languages rely heavily on mined web data, which can be noisy, domain-imbalanced, and biased; professional translation resources are limited for many languages.
Mined bitext quality and toxicity: Misaligned pairs and asymmetric toxicity in training data can induce harmful outputs; while ETOX filtering helps, residual risks remain, and filtering can reduce data volume.
Overfitting risks: Despite EOM and curriculum learning, very low-resource pairs remain susceptible to overfitting; optimal curricula and regularization may vary by language pair.
Language identification challenges: Similar-language disambiguation and domain shifts in web data continue to pose difficulties, especially at massive scale.
Evaluation constraints: Even with FLORES-200 and XSTS, coverage and calibration cannot capture all domains and phenomena; correlations with automated metrics are strong but imperfect. Some reported human-evaluation counts show minor inconsistencies (e.g., 52/56 vs. 55 directions), suggesting potential reporting or sampling differences.
Generalizability across domains: While fine-tuning shows promise, out-of-domain performance can vary; broader domain adaptation and continual learning remain open areas.
Documented additional limitations are referenced in Supplementary Information G.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals

M. Popel, M. Tomkova, et al.

Chemistry

Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments

U. V. Ucak, I. Ashyrmamatov, et al.

Political Science

How to convince in a televised debate: the application of machine learning to analyze why viewers changed their winner perception during the 2021 German chancellor discussion

F. Ettensperger, T. Waldvogel, et al.

Humanities

Reframing the narrative of magic wind in Arthur Waley's translation of *Journey to the West*: another look at the abridged translation

F. (. Wang, K. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny