A neural network trained to convert raw audio into meaningful vector representations that preserve information about speech content and speaker identity.
Quality of vision, audio, and image understanding (distinct from modality support)