A standardized test suite used to measure and compare model performance on specific tasks.
Code generation, debugging, explanation, and refactoring
Multi-step reasoning, logic puzzles, mathematical problem-solving