When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

Yu Li, Tian Lan, Zhengling Qi|March 13, 2026arXiv

Key Takeaway

By explicitly comparing correct and incorrect reasoning traces during training, you can improve reasoning model performance without extra sampling or auxiliary models—just by restructuring how the model learns from existing data.

Summary

This paper improves GRPO, a method for training reasoning models, by having the model learn from contrasts between correct and incorrect solutions in the same batch. It introduces two techniques: Bilateral Context Conditioning (letting the model compare successful vs failed reasoning traces) and Reward-Confidence Correction (stabilizing training by adjusting baselines).

training reasoning

Key Terms

group-relative-policy-optimization contrastive-learning reasoning-trace reward-confidence-covariance