No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez|March 26, 2026arXiv

Key Takeaway

You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.

Summary

This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.

training multimodal efficiency

Key Terms

vision-language-model contrastive-learning compositionality zero-shot-learning cross-modal-attention