VOID: Video Object and Interaction Deletion

Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan et al.|April 2, 2026arXiv

Key Takeaway

Video editing can be improved by treating it as a physics simulation problem: identify what changes when an object is removed, then use diffusion models guided by causal reasoning to generate realistic results.

Summary

VOID removes objects from videos while maintaining realistic physics—like correcting how other objects move or collide after removal. It uses a vision-language model to identify affected regions and a diffusion model to generate physically plausible outcomes, trained on synthetic data where physics interactions are carefully controlled.

multimodal applications reasoning

Key Terms

video-object-removal counterfactual-generation diffusion-model vision-language-model physical-plausibility