SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang et al.|March 17, 2026arXiv

Key Takeaway

Models that accurately perceive audio-visual information often fail at generating contextually appropriate conversational responses, showing that perception and interaction are separate skills that need independent evaluation.

Summary

SocialOmni is a benchmark that tests how well audio-visual AI models handle natural conversation dynamics—specifically, identifying who's speaking, knowing when to interrupt, and generating natural interruptions. Testing 12 leading models reveals that understanding what's happening in a conversation doesn't automatically translate to responding appropriately in real dialogue.

evaluation multimodal agents

Key Terms

omni-modal-language-model speaker-separation interruption-timing perception-interaction-gap