You can increase diversity in generated images by applying repulsion forces in the transformer's attention channels during generation, without expensive optimization or visual artifacts.
This paper tackles the problem of text-to-image diffusion models producing visually similar outputs for the same prompt. The authors propose a method that applies 'repulsion' in the attention mechanism during image generation to encourage diverse outputs while maintaining quality and semantic accuracy.