The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba et al.|March 23, 2026arXiv

Key Takeaway

Vision encoders, not language models, are the primary source of spatial reasoning in VLMs. Spatial information is distributed globally across all image tokens, not just object regions, and enhancing this signal improves spatial understanding tasks.

Summary

This paper reveals how vision-language models handle spatial reasoning—understanding where objects are and how they relate to each other. The researchers found that VLMs use two mechanisms: the language model processes spatial relations independently, but the vision encoder is actually the dominant source, encoding object layouts across the entire image including background areas.

multimodal reasoning evaluation

Key Terms

vision-language-model vision-encoder spatial-reasoning visual-tokens