EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei et al.|March 12, 2026arXiv

Key Takeaway

Diffusion models can solve complex reasoning tasks better by having the language encoder think iteratively and update its guidance throughout the generation process, rather than encoding instructions once at the start.

Summary

This paper improves how diffusion models solve complex reasoning tasks by making the language model encoder think step-by-step. Instead of encoding instructions once, the system iteratively refines the model's internal reasoning and feeds it progressively to the image generation process, achieving 92% accuracy on spatial reasoning tasks like mazes and puzzles.

reasoning multimodal architecture

Key Terms

chain-of-thought diffusion-models vision-language-model iterative-refinement spatial-reasoning