Training data structure, not model architecture, is why parallel language models revert to sequential generation—fix the training data to unlock ...
Diffusion language models promise faster parallel text generation, but they often end up generating tokens one-at-a-time like traditional models. This paper shows the problem is how models are trained—sequential training data pushes them toward sequential generation.