An LLM's text-only auditory knowledge is a strong predictor of how well it will perform in audio tasks—so you can evaluate audio-language models by testing their audio understanding before building them.
This paper investigates how much knowledge about sound and audio LLMs actually have from their text-only training, and whether this predicts how well they work in audio tasks. Researchers tested different LLMs three ways: directly probing their audio knowledge, having them reason about audio descriptions, and fine-tuning them into full audio-language models.