BIG-Bench Hard
A challenging subset of BIG-Bench tasks where models previously failed to match average human performance
23 challenging tasks from the BIG-Bench suite where prior language models fell below average human rater performance. Tasks include date understanding, logical deduction, tracking shuffled objects, and more. Evaluated with 3-shot chain-of-thought prompting.
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | Qwen2.5 32B Instruct | 56.5% |
| 2 | Qwen2.5 32B | 54.0% |
| 3 | gemma 2 27b it | 49.3% |
| 4 | Phi 3 medium 128k instruct | 48.5% |
| 5 | Qwen2.5 14B Instruct | 48.4% |
| 6 |
| gemma 2 9b it |
| 42.1% |
| 7 | Qwen2.5 7B | 35.8% |
| 8 | Qwen2.5 7B Instruct | 34.9% |
| 9 | Llama 3.1 8B Instruct | 29.4% |
| 10 | Qwen2.5 Coder 7B Instruct | 28.9% |
| 11 | phi 2 | 28.0% |
| 12 | Qwen2.5 3B Instruct | 25.8% |
| 13 | Meta Llama 3 8B | 24.5% |
| 14 | Llama 3.2 3B Instruct | 24.1% |
| 15 | Mistral 7B Instruct v0.2 | 22.9% |
| 16 | Qwen2.5 1.5B Instruct | 19.8% |
| 17 | gemma 2 2b it | 18.0% |
| 18 | Llama 3.2 3B | 14.2% |
| 19 | Qwen2 1.5B Instruct | 13.7% |
| 20 | Llama 3.2 1B Instruct | 8.7% |
| 21 | Qwen2.5 0.5B Instruct | 8.4% |
| 22 | gpt j 6b | 4.9% |
| 23 | falcon 7b instruct | 4.8% |
| 24 | Llama 3.2 1B | 4.4% |
| 25 | TinyLlama 1.1B Chat v1.0 | 4.0% |
| 26 | gpt2 large | 3.3% |
| 27 | distilgpt2 | 2.8% |
| 28 | gpt2 | 2.7% |
| 29 | pythia 160m | 2.2% |