State space models are a viable and more efficient alternative to vision transformers for vision-language models, challenging the assumption that transformers are necessary for this task.
This paper tests whether state space models (SSMs) can replace vision transformers as the visual backbone in vision-language models. The researchers find that SSM-based vision encoders match or outperform transformer-based encoders on VQA and visual grounding tasks, while using fewer parameters. They also identify instability issues in some backbones and propose fixes to improve robustness.