Vision encoders, not language models, are the primary source of spatial reasoning in VLMs. Spatial information is distributed globally across all image tokens, not just object regions, and enhancing this signal improves spatial understanding tasks.
This paper reveals how vision-language models handle spatial reasoning—understanding where objects are and how they relate to each other. The researchers found that VLMs use two mechanisms: the language model processes spatial relations independently, but the vision encoder is actually the dominant source, encoding object layouts across the entire image including background areas.