A single tokenizer can efficiently represent multi-view driving scenes in a way that works for both reconstruction tasks (RGB, depth) and understanding tasks (segmentation, 3D occupancy), making it practical for vision-language-action models in autonomous vehicles.
DriveTok creates a unified tokenizer for autonomous driving that converts multi-view camera images into compact 3D scene tokens. Unlike existing tokenizers designed for single images, it handles multiple camera views efficiently while preserving semantic, geometric, and depth information—enabling better reconstruction and understanding of driving scenes.