VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo et al.|March 24, 2026arXiv

Key Takeaway

Adding tactile (touch) sensing to video-based robot learning models significantly improves performance on tasks requiring precise force control and contact awareness, without needing separate tactile pretraining.

Summary

This paper introduces VTAM, a robot learning system that combines video and touch (tactile) sensing to better understand and perform complex physical tasks.

multimodal applications

Key Terms

multimodal-fusion tactile-perception world-model modality-transfer contact-rich-manipulation