SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata|February 26, 2026arXiv

Key Takeaway

You can align vision and language models with 10-100x less paired training data by leveraging unpaired images and text separately.

Summary

This paper shows how to align vision and language models using far fewer paired examples than current methods require. Instead of needing millions of image-text pairs, SOTAlign uses a small set of paired data plus lots of unpaired images and text, employing a technique called optimal transport to learn how the two models relate to each other.

multimodal training efficiency

Key Terms

alignment optimal-transport contrastive-loss semi-supervised-learning embeddings