A conversational interaction where the model can understand and respond to inputs that combine both text and images in a natural back-and-forth exchange.
Adhering to complex, structured, or constrained instructions
Quality of vision, audio, and image understanding (distinct from modality support)