The ability of a model to directly understand different types of input (like images or audio) without converting them to text first.