A task where an AI model reads a question and an image, then generates an answer based on what it understands from the image.