Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi et al.|April 2, 2026arXiv

Key Takeaway

By unifying 2D and 3D generation in one model and leveraging plentiful 2D data as a structural constraint, you can train better 3D generators with limited 3D assets—no separate 2D-to-3D conversion pipeline needed.

Summary

Omni123 is a 3D foundation model that generates both 2D images and 3D objects from text by treating them as sequences of tokens. It uses abundant 2D image data as a guide to improve 3D generation, avoiding the need for scarce aligned text-image-3D datasets. The model cycles through different modalities (text→image→3D→image) to ensure consistency across all forms.

multimodal architecture data

Key Terms

autoregressive-model discrete-tokens cross-modal-consistency geometric-consistency 3d-scene-reconstruction