VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain et al.|March 18, 2026arXiv

Key Takeaway

By treating video as a navigable hierarchical structure instead of converting it to text, you can process 10-hour videos with minimal accuracy loss while using compute that scales logarithmically with duration.

Summary

VideoAtlas is a system for understanding long videos efficiently by representing them as a hierarchical grid that can be zoomed into recursively, rather than converting video to text.

efficiency multimodal agents

Key Terms

hierarchical-reasoning long-context-handling markov-decision-process multimodal-input inference-time-compute