Pairing dense video prediction models with sparse, semantically-rich vision-language reasoning improves long-horizon forecasting—VLMs provide the 'what' and 'why', while dense models provide the 'how'.
This paper combines two approaches to video prediction: dense frame-by-frame modeling (JEPA) for capturing fine-grained motion, and vision-language models (VLMs) for long-horizon semantic understanding. By using both pathways together, the system predicts future video frames better than either approach alone, especially for complex hand manipulation tasks.