By explicitly comparing correct and incorrect reasoning traces during training, you can improve reasoning model performance without extra sampling or auxiliary models—just by restructuring how the model learns from existing data.
This paper improves GRPO, a method for training reasoning models, by having the model learn from contrasts between correct and incorrect solutions in the same batch. It introduces two techniques: Bilateral Context Conditioning (letting the model compare successful vs failed reasoning traces) and Reward-Confidence Correction (stabilizing training by adjusting baselines).