3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang et al.|March 24, 2026arXiv

Key Takeaway

This work shows how to scale vision-language models from room-sized scenes to entire cities by handling 3D spatial relationships and introducing a large, quality-controlled urban dataset—essential for building AI systems that understand real-world spatial reasoning.

Summary

3DCity-LLM extends multimodal AI models to understand entire city-scale 3D environments, not just individual objects. The system uses a three-part approach to analyze objects, their relationships, and overall scenes, trained on a new dataset of 1.2 million urban scenarios covering tasks from object identification to city planning.

multimodal applications

Key Terms

vision-language-model coarse-to-fine-encoding spatial-reasoning 3d-scene-understanding