When using LLM-as-a-judge for evaluation, avoid using the best or worst model as your anchor—choose a mediocre one instead. Anchor selection matters as much as which judge model you pick, and most benchmarks are too small to reliably compare competitive models.
This paper reveals that choosing the right reference model (anchor) for LLM-as-a-judge evaluation is critical but overlooked. The researchers tested 22 different anchors and found that extreme choices—the best or worst models—actually make poor anchors because they don't help distinguish between similar models.