SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart et al.|March 30, 2026arXiv

Key Takeaway

Video-language models can supervise robot learning directly as reward signals if trained with spatiotemporal reasoning and grounded in continuous progress supervision, enabling robots to learn new tasks without hand-crafted rewards.

Summary

SOLE-R1 is a video-language model that watches robot videos and reasons about task progress step-by-step to provide reward signals for robot learning. Unlike standard vision-language models, it's designed to handle partial views and changing conditions, preventing robots from gaming the reward system.

reasoning agents multimodal

Key Terms

chain-of-thought reward-model reinforcement-learning vision-language-model reward-hacking