A pre-trained model that jointly processes and understands both visual and textual information in a unified representation.