LMSYS Chatbot Arena
Live crowd-sourced evaluation where users chat with two anonymous models side-by-side and vote for the better response. Produces Elo ratings
Users submit prompts to two anonymous models and select which response they prefer (or tie). Votes are aggregated into Bradley-Terry (Elo) ratings. Over 1M+ human votes collected. Categories include overall, coding, math, hard prompts, and more.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 1,285 |