Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung et al.|March 12, 2026arXiv

Key Takeaway

Test-time training—updating model parameters on-the-fly during inference—enables better spatial reasoning from video by letting the model continuously organize and retain 3D spatial information rather than relying on fixed context windows.

Summary

This paper introduces Spatial-TTT, a system that helps AI models understand 3D spaces from continuous video streams by adapting and updating their internal parameters during inference. It combines efficient video processing with a spatial prediction mechanism and specialized training data to maintain accurate spatial understanding over long videos.

architecture reasoning multimodal

Key Terms

test-time-training spatial-reasoning sliding-window-attention spatiotemporal-representations fast-weights