Code Review Agent Benchmark

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen et al.|March 24, 2026arXiv

Key Takeaway

Code review agents currently miss most issues that human reviewers catch, but they often flag different problems—creating opportunities for AI-assisted rather than AI-automated code review in real teams.

Summary

This paper introduces c-CRAB, a benchmark dataset for evaluating AI agents that perform code review on pull requests. The dataset is built from human reviews and includes automated tests to assess whether code review agents catch the same issues humans do.

evaluation agents applications

Key Terms

code-review pull-request benchmark-dataset agentic-tasks