DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

1Shanghai Innovation Institute, 2Fudan University, 3University of Science and Technology of China, 4Shanghai Jiao Tong University, 5Zhejiang University, 6Westlake University, 7Nanjing University, 8University of Southern California
* Core Contributors    Project Leaders

DeepGen Overview

data-overview
data-overview

Method Overview.
DeepGen 1.0 adopts a unified VLM–DiT framework with dual-branch visual encoding: a ViT provides semantic features, while a VAE extracts latent representations for the DiT. Multimodal VLM conditions and reference-image VAE latents are concatenated with target noise tokens as a single DiT sequence, enabling joint self-attention over conditioning and generation. Stacked Channel Bridging (SCB) fuses VLM and DiT features, and positional encodings distinguish reference from target tokens. Icons denote frozen/trainable modules across pre-training, SFT, and RL.

data-overview

Training Data and Evaluation Overview.
Overview of our training data for broad omni-capabilities and comprehensive evaluation across benchmarks.

data-overview

This figure presents a comparison on image generation and editing benchmarks. Bubble size is proportional to model parameter count. Dashed outer rings indicate models with unreported parameter counts. Higher scores correspond to better performance.

Quantitative Results

data-overview

This figure presents quantitative results across general image generation and editing benchmarks. Top-1/2/3 results within each column excluding closed-source models are marked with gold, silver, and bronze icons.

data-overview

This figure presents quantitative results of reasoning-based text-to-image generation involving world knowledge on the WISE benchmark. "*" denotes generation with textual CoT reasoning.

data-overview

This figure presents quantitative results of reasoning-based text-to-image generation with the philosophical framework on the T2I- CoREBench benchmark through Qwen3-VL-32B-Thinking. "*" denotes generation with textual CoT reasoning.

data-overview

This figure presents quantitative results of reasoning-based editing involving world knowledge on the RISE and UniREditBench. "*" denotes generation with textual CoT reasoning.

data-overview

This figure presents quantitative results of text rendering on the CVTG-2K.

BibTeX