You can prune half of video tokens across both vision and language components without complex mechanisms, gaining significant speed improvements (62%) while maintaining performance—making video VLMs practical for real-world deployment.
This paper introduces a method to speed up video understanding models by removing redundant visual information. The technique scores and removes 50% of unnecessary visual tokens across the entire model architecture, achieving 62% faster processing with minimal accuracy loss on video question-answering tasks.