You can train tokenization and image generation together from scratch using a single model with shared weights, simplifying the pipeline and reducing training complexity while maintaining quality.
This paper proposes UNITE, a new way to train image generation models more efficiently by combining tokenization and diffusion in a single training stage.