UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu|March 23, 2026arXiv

Key Takeaway

Treating motion as a continuous first-class modality rather than discretizing it enables a single model to handle motion-text-image tasks end-to-end, achieving better performance on cross-modal tasks like describing motion or editing poses from text.

Summary

UniMotion is the first unified AI system that understands and generates human motion, text, and images all in one model. Instead of converting motion into discrete tokens (which loses information), it treats motion as a continuous stream like video, using a shared language model backbone with special techniques to align motion with visual and text understanding.

multimodal architecture

Key Terms

multimodal-input continuous-representation variational-autoencoder knowledge-distillation self-supervised-pretraining