R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao|March 26, 2026arXiv

Key Takeaway

Cross-modal inconsistencies in multimodal models aren't just failures to hide—they're valuable training signals that, when enforced through cycle consistency, improve reasoning accuracy by up to 7.6 points and reduce systematic biases.

Summary

This paper introduces RC2, a reinforcement learning approach that improves multimodal AI models by enforcing consistency between visual and textual understanding. Instead of ignoring when a model gives contradictory answers for the same concept in different modalities, the method uses these conflicts as training signals.

reasoning multimodal

Key Terms

multimodal-alignment cycle-consistency cross-modal-inconsistency label-free-reward