A model designed to understand and reason about both visual content (images) and natural language text together.
Quality of vision, audio, and image understanding (distinct from modality support)