The task of automatically generating natural language descriptions of images, converting visual information into written words.
Quality of vision, audio, and image understanding (distinct from modality support)