Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu et al.|March 19, 2026arXiv

Key Takeaway

Decoupling semantic understanding from real-time perception—parsing queries once and matching embeddings continuously—solves the efficiency-accuracy tradeoff in proactive video understanding systems.

Summary

Em-Garde is a framework for understanding streaming video that responds to user queries efficiently. Instead of checking every frame, it converts user questions into visual proposals and matches them against the video stream using fast embedding comparisons, achieving better accuracy and speed than existing approaches.

multimodal efficiency reasoning

Key Terms

streaming-inference embedding-based-matching proposal-generation vision-language-model