An AI model that can process and understand spoken audio directly, without needing to convert speech to text first.
Quality of vision, audio, and image understanding (distinct from modality support)