Adaptive Greedy Frame Selection for Long Video Understanding

Yuning Huang, Fengqing Zhu|March 20, 2026arXiv

Key Takeaway

By selecting frames that are both relevant to the question and visually diverse, you can cut inference costs significantly while maintaining or improving accuracy on video QA tasks, especially when frame budgets are tight.

Summary

This paper tackles a key bottleneck in video understanding: processing long videos with vision-language models requires too many frames and tokens. The authors propose a smart frame selection method that picks the most important frames by balancing two goals—relevance to the question asked and diversity of visual content—using a greedy algorithm with theoretical guarantees.

efficiency multimodal evaluation

Key Terms

vision-language-models facility-location-coverage submodular-optimization semantic-representativeness