Models that accurately perceive audio-visual information often fail at generating contextually appropriate conversational responses, showing that perception and interaction are separate skills that need independent evaluation.
SocialOmni is a benchmark that tests how well audio-visual AI models handle natural conversation dynamics—specifically, identifying who's speaking, knowing when to interrupt, and generating natural interruptions. Testing 12 leading models reveals that understanding what's happening in a conversation doesn't automatically translate to responding appropriately in real dialogue.