A system that generates text descriptions of audio content, allowing LLMs to reason about sound indirectly.