Decouple learning long-term coherence from local quality to generate minute-scale videos without needing massive amounts of long-form training data.
This paper solves a key problem in video generation: making long videos (minutes) that are both sharp and coherent. The trick is training two separate components—one learns long-term story structure from rare long videos, while another copies local quality from abundant short videos. This lets the model generate minute-long videos that look crisp and stay consistent throughout.