Mode Seeking meets Mean Seeking for Fast Long Video Generation

Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang et al.|February 27, 2026arXiv

Key Takeaway

Decouple learning long-term coherence from local quality to generate minute-scale videos without needing massive amounts of long-form training data.

Summary

This paper solves a key problem in video generation: making long videos (minutes) that are both sharp and coherent. The trick is training two separate components—one learns long-term story structure from rare long videos, while another copies local quality from abundant short videos. This lets the model generate minute-long videos that look crisp and stay consistent throughout.

training efficiency architecture

Key Terms

diffusion-transformer flow-matching reverse-kl-divergence knowledge-distillation