Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen|May 26, 2026arXiv

Key Takeaway

You can guide diffusion models to generate specific images by using learned representations as conditioning signals, avoiding the need for expensive annotated datasets while maintaining smooth, interpretable control.

Summary

This paper shows how to control image generation in diffusion models by conditioning them on representations from self-supervised models instead of requiring text or semantic annotations. The approach discovers interpretable directions in the representation space that let you smoothly control what gets generated.

architecture multimodal efficiency

Key Terms

diffusion-models self-supervised-pretraining conditioning disentanglement semantic-representation