Decoupling semantic understanding from real-time perception—parsing queries once and matching embeddings continuously—solves the efficiency-accuracy tradeoff in proactive video understanding systems.
Em-Garde is a framework for understanding streaming video that responds to user queries efficiently. Instead of checking every frame, it converts user questions into visual proposals and matches them against the video stream using fast embedding comparisons, achieving better accuracy and speed than existing approaches.