You can align vision and language models with 10-100x less paired training data by leveraging unpaired images and text separately.
This paper shows how to align vision and language models using far fewer paired examples than current methods require. Instead of needing millions of image-text pairs, SOTAlign uses a small set of paired data plus lots of unpaired images and text, employing a technique called optimal transport to learn how the two models relate to each other.