ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen et al.|March 23, 2026arXiv

Key Takeaway

Pairing dense video prediction models with sparse, semantically-rich vision-language reasoning improves long-horizon forecasting—VLMs provide the 'what' and 'why', while dense models provide the 'how'.

Summary

This paper combines two approaches to video prediction: dense frame-by-frame modeling (JEPA) for capturing fine-grained motion, and vision-language models (VLMs) for long-horizon semantic understanding. By using both pathways together, the system predicts future video frames better than either approach alone, especially for complex hand manipulation tasks.

multimodal reasoning architecture

Key Terms

latent-world-model vision-language-model jepa hierarchical-representation-extraction dual-temporal-pathway