DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou et al.|March 19, 2026arXiv

Key Takeaway

A single tokenizer can efficiently represent multi-view driving scenes in a way that works for both reconstruction tasks (RGB, depth) and understanding tasks (segmentation, 3D occupancy), making it practical for vision-language-action models in autonomous vehicles.

Summary

DriveTok creates a unified tokenizer for autonomous driving that converts multi-view camera images into compact 3D scene tokens. Unlike existing tokenizers designed for single images, it handles multiple camera views efficiently while preserving semantic, geometric, and depth information—enabling better reconstruction and understanding of driving scenes.

multimodal architecture applications

Key Terms

image-tokenization multi-view-fusion 3d-scene-reconstruction semantic-occupancy cross-attention