When building federated systems with multi-modal data, you can align different data types in a shared compressed space using learnable projections, reducing both communication overhead and the need for all devices to use identical architectures.
This paper presents CoMFed, a federated learning system that lets multiple devices train together on different types of data (like video and audio) without sharing raw information. It uses compressed representations and alignment techniques to handle the challenge of different devices having different data types and model structures, while keeping communication costs low.