You can teach vision-language models to understand compositional meaning by focusing on concept-level alignment and preserving fine-grained visual information—without custom data or hurting general performance.
This paper improves how vision-language models learn to understand combinations of concepts (like "red car" vs "blue car") without sacrificing their ability to recognize new objects.