By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.
Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.