Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang et al.|March 18, 2026arXiv

Key Takeaway

By explicitly training vision-language models to reconstruct 3D scene geometry and camera position from video, you can dramatically improve their spatial reasoning and localization abilities without changing the model architecture.

Summary

Loc3R-VLM adds 3D spatial understanding to vision-language models by training them on video input with two key objectives: reconstructing the overall scene layout and modeling the camera's viewpoint. This approach helps models better understand where things are located in 3D space and answer questions about scenes from different perspectives, outperforming existing 2D and video-based methods.

multimodal reasoning

Key Terms

vision-language-model monocular-depth-estimation egocentric-perspective camera-pose-estimation spatial-grounding