Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend|March 17, 2026arXiv

Key Takeaway

When using LLM-as-a-judge for evaluation, avoid using the best or worst model as your anchor—choose a mediocre one instead. Anchor selection matters as much as which judge model you pick, and most benchmarks are too small to reliably compare competitive models.

Summary

This paper reveals that choosing the right reference model (anchor) for LLM-as-a-judge evaluation is critical but overlooked. The researchers tested 22 different anchors and found that extreme choices—the best or worst models—actually make poor anchors because they don't help distinguish between similar models.

evaluation

Key Terms

llm-as-a-judge anchor-selection pairwise-comparison