Task where an AI agent navigates physical spaces by following natural language instructions while processing visual input.