By selecting frames that are both relevant to the question and visually diverse, you can cut inference costs significantly while maintaining or improving accuracy on video QA tasks, especially when frame budgets are tight.
This paper tackles a key bottleneck in video understanding: processing long videos with vision-language models requires too many frames and tokens. The authors propose a smart frame selection method that picks the most important frames by balancing two goals—relevance to the question asked and diversity of visual content—using a greedy algorithm with theoretical guarantees.