A neural network design that learns to match images and text by training them to have similar representations, enabling tasks like image search and visual understanding.
Quality of vision, audio, and image understanding (distinct from modality support)