On Optimizing Multimodal Jailbreaks for Spoken Language Models

Aravind Krishnan, Karolina Stańczak, Dietrich Klakow|March 19, 2026arXiv

Key Takeaway

Multimodal AI systems need safety defenses that account for attacks across all input modalities together—defending text alone or audio alone isn't enough.

Summary

This paper shows that spoken language models (which process both speech and text) can be attacked more effectively by perturbing both modalities simultaneously rather than just one. The researchers developed JAMA, a method that jointly optimizes adversarial text and audio to bypass safety guardrails, achieving 1.5x to 10x higher attack success rates than single-modality attacks.

safety multimodal

Key Terms

jailbreaking gradient-based-optimization multimodal-attack projected-gradient-descent