Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.
VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.