This work shows how to scale vision-language models from room-sized scenes to entire cities by handling 3D spatial relationships and introducing a large, quality-controlled urban dataset—essential for building AI systems that understand real-world spatial reasoning.
3DCity-LLM extends multimodal AI models to understand entire city-scale 3D environments, not just individual objects. The system uses a three-part approach to analyze objects, their relationships, and overall scenes, trained on a new dataset of 1.2 million urban scenarios covering tasks from object identification to city planning.