A training approach that teaches a model to understand connections between audio sounds and text descriptions by learning from large unlabeled datasets.
Quality of vision, audio, and image understanding (distinct from modality support)