Code review agents currently miss most issues that human reviewers catch, but they often flag different problems—creating opportunities for AI-assisted rather than AI-automated code review in real teams.
This paper introduces c-CRAB, a benchmark dataset for evaluating AI agents that perform code review on pull requests. The dataset is built from human reviews and includes automated tests to assess whether code review agents catch the same issues humans do.