A design pattern that connects a vision encoder to a language model, enabling the language model to understand and describe images.
Quality of vision, audio, and image understanding (distinct from modality support)