When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO — ThinkLLM