Treating motion as a continuous first-class modality rather than discretizing it enables a single model to handle motion-text-image tasks end-to-end, achieving better performance on cross-modal tasks like describing motion or editing poses from text.
UniMotion is the first unified AI system that understands and generates human motion, text, and images all in one model. Instead of converting motion into discrete tokens (which loses information), it treats motion as a continuous stream like video, using a shared language model backbone with special techniques to align motion with visual and text understanding.