LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu et al.|March 20, 2026arXiv

Key Takeaway

To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.

Summary

LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.

multimodal architecture data

Key Terms

diffusion-models text-to-video-generation self-attention cross-attention intra-group-consistency