Querying Transformer

architecture

A neural network component that acts as a bridge between an image encoder and language model, learning to extract and translate visual information into text-compatible representations.

Related Capabilities

Multimodal

Quality of vision, audio, and image understanding (distinct from modality support)

439