A harder variant of MMLU with 10 answer choices instead of 4 and more reasoning-intensive questions, reducing noise from random guessing
Methodology
12,032 questions across 14 domains, expanded from MMLU's 4-choice to 10-choice format. Questions are filtered for difficulty and augmented with reasoning-heavy problems from STEM sources. Significantly reduces guessing advantage (10% vs 25% random baseline).
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.