How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang et al.|March 19, 2026arXiv

Key Takeaway

An LLM's text-only auditory knowledge is a strong predictor of how well it will perform in audio tasks—so you can evaluate audio-language models by testing their audio understanding before building them.

Summary

This paper investigates how much knowledge about sound and audio LLMs actually have from their text-only training, and whether this predicts how well they work in audio tasks. Researchers tested different LLMs three ways: directly probing their audio knowledge, having them reason about audio descriptions, and fine-tuning them into full audio-language models.

evaluation multimodal training

Key Terms

large-audio-language-model audio-encoder audio-captioner auditory-knowledge