Expert-level multiple-choice questions in biology, chemistry, and physics. The Diamond subset contains the hardest questions verified by multiple domain experts
Methodology
448 expert-crafted multiple-choice questions across biology, chemistry, and physics. Each question was validated by at least two domain experts to ensure questions cannot be answered through web search alone. Models are evaluated on 0-shot or few-shot accuracy.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.