Multimodal AI systems need safety defenses that account for attacks across all input modalities together—defending text alone or audio alone isn't enough.
This paper shows that spoken language models (which process both speech and text) can be attacked more effectively by perturbing both modalities simultaneously rather than just one. The researchers developed JAMA, a method that jointly optimizes adversarial text and audio to bypass safety guardrails, achieving 1.5x to 10x higher attack success rates than single-modality attacks.