Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna et al.|March 18, 2026arXiv

Key Takeaway

You can prune half of video tokens across both vision and language components without complex mechanisms, gaining significant speed improvements (62%) while maintaining performance—making video VLMs practical for real-world deployment.

Summary

This paper introduces a method to speed up video understanding models by removing redundant visual information. The technique scores and removes 50% of unnecessary visual tokens across the entire model architecture, achieving 62% faster processing with minimal accuracy loss on video question-answering tasks.

efficiency multimodal architecture

Key Terms

token-pruning vision-language-model spatio-temporal-reasoning vision-transformer temporal-redundancy