Combining multimodal inputs (like text and images) at early layers of a model rather than after separate encoding.