logo
ResearchBunny Logo
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Computer Science

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W. Chiang, L. Zheng, et al.

Discover Chatbot Arena, an open, crowdsourced platform that evaluates Large Language Models via pairwise human-preference comparisons, powered by over 240K votes and rigorous statistical ranking — research conducted by Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the challenge of evaluating LLMs in a way that reflects human preferences and real-world, open-ended usage. Existing evaluations are dominated by static, ground-truth-based benchmarks that are inexpensive and reproducible but limited: they are often closed-ended, subject to contamination and overfitting, and cannot adequately capture alignment with human preferences, especially for complex tasks lacking definitive ground truth. The authors propose an open, live evaluation platform based on human preference—Chatbot Arena—that collects diverse, fresh prompts via crowdsourcing and uses pairwise comparisons of anonymized model responses. The goal is to provide scalable, statistically principled rankings that better mirror real-world usage and user preferences.
Literature Review
The paper situates Chatbot Arena within LLM evaluation work along two axes: question source (static vs. live) and evaluation metric (ground truth vs. human preference). It reviews prevalent static, ground-truth benchmarks such as MMLU, HellaSwag, GSM-8K, BigBench, AGIEval, and HumanEval, as well as safety-focused and holistic evaluations (ToxicChat, HELM). Open-ended evaluations with human or LLM-as-judge (e.g., MT-Bench, AlpacaEval) are also discussed. Live benchmarks can use fresh questions from competitions (e.g., Codeforces) or human interactions, though prior live human-interaction studies are typically closed to specific organizations. The paper highlights risks of static benchmarks—contamination, saturation, overfitting, and misalignment with human preferences—and draws inspiration from DynaBench’s human-in-the-loop approach. It also connects to ranking system literature (Bradley–Terry, Elo, rank elicitation, online experiment design) and prior human preference datasets (OpenAssistant, HH-RLHF, LMSYS-Chat-1M, UltraFeedback, Nectar), noting Chatbot Arena uniquely combines open, large-scale, crowdsourced, live preference data suitable for ranking.
Methodology
Platform and data collection: Chatbot Arena hosts anonymous, randomized pairwise battles between two LLMs. Users enter any prompt (no preset) and receive two anonymized responses; they then vote for the preferred response, with options for tie or both-bad. Models’ identities are revealed only after voting. Users must accept terms of use permitting data release. To preserve anonymity, content filters remove mentions of model names/companies; OpenAI’s moderation flags unsafe content (≈3% of requests). The platform has run since April 2023, collecting ≈240K votes from ≈90K users across >100 languages for >50 models by Jan 2024. Statistical framework for ranking: The data comprise sequential pairwise comparisons A_t with outcomes H_t (primarily binary). The goal is to estimate the win matrix θ*(a)=E[H_t|A_t=a] and derive a model score and ranking. Scores are based on the Bradley–Terry (BT) model, where P(model m beats m′)=1/(1+e^{ξ_{m′}-ξ_m}). BT coefficients ξ are estimated via reweighted maximum likelihood (inverse-propensity weighting by sampling probability P(A_t)) to target a uniform distribution over pairs. Confidence intervals on ξ use robust “sandwich” standard errors (Huber–White), selected over pivot bootstrap based on simulation results for tighter large-sample intervals. Approximate rankings are computed by constructing a simultaneous confidence set (chi-square CLT interval with multiplicity correction) and ordering models by non-overlapping intervals to control understatement/overstatement probabilities. Win matrix estimation and uncertainty: For pair a, define X_t(a)=I{A_t=a}/P_t(a); the unbiased estimator is θ̂_T=(1/T)∑_t X_t with covariance estimated by the empirical second moment. Under regularity conditions with exploration (P_t(a)>ε>0) and convergence of P_t, a CLT yields asymptotic normality, enabling confidence intervals for win rates and BT scores. Active sampling: To improve sample efficiency, the system actively chooses pairs with probability proportional to the estimated reduction in confidence interval width: P_t(a) ∝ Σ̂_{t,a,a}/sqrt(n_a)+Σ̂_{t,a,a}/sqrt(n_a+1), where n_a is the current number of observations of pair a. This concentrates sampling on uncertain, similarly performing pairs and accelerates convergence while retaining statistical validity (ensuring minimum exploration). Anomalous user detection: The method compares each user’s sequence of votes to historical distributions per action using exchangeability-based p-values p_i computed from rank statistics relative to H_A for the given pair. It then applies sequential Fisher combination tests at random vote counts j∈{1,…,100} with Bonferroni-style correction, flagging users when M_j=−2∑_{i=1}^j log p_i exceeds χ²_{2j,1−α/5}. Topic modeling and prompt analysis: To assess diversity and discriminative power of crowdsourced prompts, the authors build a BERTopic pipeline: prompts are embedded with OpenAI text-embedding-3-small, reduced to 5 dimensions via UMAP, clustered using HDBSCAN (min cluster size 32), and labeled by GPT-4-Turbo (10 sampled prompts per cluster). This identifies 600 clusters across many domains (e.g., poetry, coding, math, medical). They analyze cluster sizes and similarities and use sampled prompts for targeted model comparisons (LLM-as-judge) and for constructing a challenging benchmark (Arena Bench) sampled across clusters to maximize topic coverage and difficulty. Implementation details: The platform supports >50 state-of-the-art models (proprietary and open), employs non-uniform sampling to focus on uncertain pairs, and regularly updates leaderboards. Data and code are intended to be open-sourced, with public releases of preference datasets.
Key Findings
- Scale and diversity: As of Jan 2024, ≈240K votes from ≈90K users across >100 languages (≈77% English, ≈5% Chinese; others each <2%) over >50 models. Average ≈8K votes per model; 1–2K votes/day recently with spikes upon new model/leaderboard updates. - Prompt diversity: Topic modeling identifies 600 clusters with a long-tail distribution; the largest cluster ≈1% of prompts and most clusters <0.5%. Cluster similarities are small, indicating diverse coverage (poetry, coding, math, medical, etc.). - Discriminative power of prompts: In LLM-as-judge comparisons (GPT-4-turbo as judge) on 30 prompts sampled from topic clusters, GPT-4-0613 strongly outperforms Llama-2-70B-chat on coding/reasoning clusters (e.g., Python Game Programming Challenge 96.7% win rate; C/C++ Multi-Threading 86.7%; SQL Assistance 73.3%), while on less problem-solving clusters the advantage narrows (e.g., Poetry Writing 66.7%; Linguistic Analysis & Wordplay 58.3%; Travel Planning 58.3%; Movie Recommendations 53.3%). - Agreement with experts: On 160 sampled battles (GPT-4-Turbo vs Llama-2-13B and GPT-4-Turbo vs GPT-3.5-Turbo-0613), agreement rates between crowd users and experts range 72.8%–83.1%. Expert–expert agreement is 79.3%–89.8%. GPT-4-as-judge shows similar agreement with both crowd and experts (~75%–81%). Corresponding GPT-4-Turbo win rates: vs Llama-2-13B—crowd 81.2%, Expert1 89.4%, Expert2 86.9%, GPT-4-judge 78.8%; vs GPT-3.5-Turbo—crowd 76.3%, Expert1 82.5%, Expert2 89.4%, GPT-4-judge 79.4%. - Ranking intervals and coverage: Simulations show uncorrected sandwich-based confidence intervals for BT coefficients achieve near-nominal coverage (≈1−α) across varying numbers of models, with interval widths decreasing as samples increase and increasing with more models. - Active sampling efficiency: Using a replay/simulation with 213,576 historical votes, the active sampling rule reduces sample requirements compared to random sampling. To estimate the win matrix θ* to precision 0.2, random needs ~6,800 samples vs adaptive ~4,400 (≈54% more for random). To estimate BT score precision 0.3, random needs ~17,200 vs adaptive ~16,400 (~5% more for random). Improvements hold across the range of target precisions. - Anomalous user detection: On a curated evaluation (25 anomalous and 25 normal users), the sequential Fisher-based detector achieves up to ~90% true positive rate and ~60–70% true negative rate depending on α, with harder-to-detect cases arising from intermittent abnormal behavior. - Safety/moderation: ~3% of user requests are flagged by moderation. Identity mentions are filtered to preserve anonymity. - Community impact: The platform has become a widely referenced leaderboard, with collaboration from major model developers; commitment to open data/code and release of a >100K pairwise preference dataset.
Discussion
The findings support that crowdsourced, live pairwise comparisons can credibly evaluate LLMs aligned with human preferences. Diverse, user-generated prompts effectively discriminate model strengths across domains (e.g., coding vs. creative writing), addressing limitations of static, closed-ended benchmarks. High agreement between crowd votes and expert judgments, comparable to inter-expert agreement, indicates that aggregated crowd preferences are a reliable proxy for expert evaluation. Statistically principled BT modeling with robust intervals and multiplicity control enables uncertainty-aware rankings. The active sampling strategy improves sample efficiency without sacrificing validity, facilitating scalable evaluation across many models. Together, these results demonstrate that open, live, human-preference-based evaluation can produce trustworthy, up-to-date leaderboards reflecting real-world usage.
Conclusion
The paper introduces Chatbot Arena, an open, large-scale platform for evaluating LLMs via crowdsourced pairwise preferences. It demonstrates that live, diverse prompts can distinguish models across topics; crowd judgments align well with experts; and statistically grounded methods (BT modeling with robust intervals and active sampling) yield efficient, uncertainty-aware rankings. The authors commit to open-sourcing data and code and releasing a human preference dataset with over 100K pairwise votes. Future work includes topic-specific leaderboards, expanding to multimodal and agent tasks in more dynamic settings, and refining anomaly detection with anytime-valid inference frameworks (e.g., E-values).
Limitations
The user base likely skews toward LLM hobbyists and researchers, introducing potential sampling bias. The data predominantly come from an online chat interface, which may not reflect production or specialized domain use, potentially skewing prompt distributions. The study focuses on helpfulness and does not explicitly evaluate safety aspects. Additionally, some user prompts lack clear ground truth, contributing to residual disagreement among evaluators.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny