By explicitly training vision-language models to reconstruct 3D scene geometry and camera position from video, you can dramatically improve their spatial reasoning and localization abilities without changing the model architecture.
Loc3R-VLM adds 3D spatial understanding to vision-language models by training them on video input with two key objectives: reconstructing the overall scene layout and modeling the camera's viewpoint. This approach helps models better understand where things are located in 3D space and answer questions about scenes from different perspectives, outperforming existing 2D and video-based methods.