Diffusion language models can be trained more effectively by embedding a simulated denoising trajectory into training, and this memory mechanism can be reused at inference time to improve long-context retrieval tasks.
This paper addresses a key problem in diffusion language models: they're trained one way (predicting masked tokens) but used differently (multi-step denoising). MemDLM fixes this mismatch by simulating the denoising process during training using a memory mechanism that learns from each sample's trajectory, leading to faster training and better long-context performance.