Vision-language models' safety decisions are easily manipulated by semantic cues—they rely on learned associations rather than grounded reasoning about actual danger, which is a critical vulnerability for real-world deployment.
This paper reveals that vision-language models make safety decisions based on surface-level visual and textual cues rather than genuine understanding of dangerous situations. Researchers created a benchmark and steering framework showing that simple changes to how a scene is described or presented can flip safety judgments, exposing a vulnerability in how these models assess risk.