By treating video as a navigable hierarchical structure instead of converting it to text, you can process 10-hour videos with minimal accuracy loss while using compute that scales logarithmically with duration.
VideoAtlas is a system for understanding long videos efficiently by representing them as a hierarchical grid that can be zoomed into recursively, rather than converting video to text.