You can guide diffusion models to generate specific images by using learned representations as conditioning signals, avoiding the need for expensive annotated datasets while maintaining smooth, interpretable control.
This paper shows how to control image generation in diffusion models by conditioning them on representations from self-supervised models instead of requiring text or semantic annotations. The approach discovers interpretable directions in the representation space that let you smoothly control what gets generated.