Test-time training—updating model parameters on-the-fly during inference—enables better spatial reasoning from video by letting the model continuously organize and retain 3D spatial information rather than relying on fixed context windows.
This paper introduces Spatial-TTT, a system that helps AI models understand 3D spaces from continuous video streams by adapting and updating their internal parameters during inference. It combines efficient video processing with a spatial prediction mechanism and specialized training data to maintain accurate spatial understanding over long videos.