To generate videos with multiple people where each person's appearance stays consistent with their attributes, you need both better training data that captures identity-attribute relationships and model attention mechanisms designed to enforce those relationships.
LumosX improves personalized video generation by explicitly linking identities to their attributes. It uses a data pipeline with multimodal AI to extract subject relationships, then applies specialized attention mechanisms in diffusion models to ensure faces stay consistent with their assigned attributes across video frames.