Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu|February 26, 2026arXiv

Key Takeaway

Training data structure, not model architecture, is why parallel language models revert to sequential generation—fix the training data to unlock ...

Summary

Diffusion language models promise faster parallel text generation, but they often end up generating tokens one-at-a-time like traditional models. This paper shows the problem is how models are trained—sequential training data pushes them toward sequential generation.

training efficiency data

Key Terms

diffusion-language-models non-autoregressive-decoding parallel-decoding chain-of-thought