End-to-End Training for Unified Tokenization and Latent Denoising

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman et al.|March 23, 2026arXiv

Key Takeaway

You can train tokenization and image generation together from scratch using a single model with shared weights, simplifying the pipeline and reducing training complexity while maintaining quality.

Summary

This paper proposes UNITE, a new way to train image generation models more efficiently by combining tokenization and diffusion in a single training stage.

architecture training efficiency

Key Terms

latent-diffusion-models tokenization autoencoder weight-sharing latent-space