Diffusion models can solve complex reasoning tasks better by having the language encoder think iteratively and update its guidance throughout the generation process, rather than encoding instructions once at the start.
This paper improves how diffusion models solve complex reasoning tasks by making the language model encoder think step-by-step. Instead of encoding instructions once, the system iteratively refines the model's internal reasoning and feeds it progressively to the image generation process, achieving 92% accuracy on spatial reasoning tasks like mazes and puzzles.