BBH

BIG-Bench Hard

reasoningScore: 0-100 (% accuracy)29 models scored

About

A challenging subset of BIG-Bench tasks where models previously failed to match average human performance

Methodology

23 challenging tasks from the BIG-Bench suite where prior language models fell below average human rater performance. Tasks include date understanding, logical deduction, tracking shuffled objects, and more. Evaluated with 3-shot chain-of-thought prompting.

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	Qwen2.5 32B Instruct	56.5%
2	Qwen2.5 32B	54.0%
3	gemma 2 27b it	49.3%
4	Phi 3 medium 128k instruct	48.5%
5	Qwen2.5 14B Instruct	48.4%
6