Video models reason through iterative refinement across denoising steps (not frame-by-frame), exploring candidate solutions early and converging later—a mechanism you can exploit by ensembling outputs from different random seeds.
This paper reveals how video diffusion models actually perform reasoning—not by processing frames sequentially, but by exploring multiple solutions across denoising steps and converging to answers.