Evaluates models on completing real-world terminal and shell tasks, including file manipulation, system commands, and scripting
Models are given access to a terminal and asked to complete practical shell tasks. Covers bash scripting, file operations, process management, and system administration tasks. Success is judged by whether the desired system state is achieved.
No model scores recorded yet
Scores will appear here as the pipeline processes model data