HumanEval

codingScore: 0-100 (% pass@1)2 models scored

About

164 Python programming problems requiring code generation, evaluated by running tests

Methodology

164 hand-crafted Python programming problems with function signatures and docstrings. Models generate function bodies, which are tested against hidden unit tests. Primary metric is pass@1 (percentage of problems solved on first attempt).

Paper Dataset

Model Leaderboard

Shows open-weight models only. Commercial API models (GPT-4o, Claude, Gemini) are not submitted to the Open LLM Leaderboard — their scores come from provider-reported benchmarks.

#	Model	Score
1	GPT-4o	90.2%
2	Gemini 2.0 Flash	89.0%