Vision-language models need explicit metric reasoning to ground spatial language in 3D environments—decomposing queries into semantic and spatial components and combining them probabilistically improves grounding accuracy for robot navigation tasks.
This paper tackles the problem of robots understanding natural language commands that mix semantic meaning with precise spatial measurements, like 'go two meters right of the fridge.