164 Python programming problems requiring code generation, evaluated by running tests
164 hand-crafted Python programming problems with function signatures and docstrings. Models generate function bodies, which are tested against hidden unit tests. Primary metric is pass@1 (percentage of problems solved on first attempt).
Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.
| # | Model | Score |
|---|---|---|
| 1 | GPT-4o | 90.2% |
| 2 | Gemini 2.0 Flash | 89.0% |