A lightweight connector module that bridges a frozen image encoder and a language model, translating visual information into a format the language model can understand.
Quality of vision, audio, and image understanding (distinct from modality support)