An AI model that can process and understand multiple types of input data, such as video, images, and text together.