A model trained on image-text pairs to create shared vector representations for both images and text.