The percentage of correct answers a model produces on a benchmark, measured by standard evaluation metrics.