mo00onfox

Seedream 3.0 → 4.0: Technical Comparison & NeurIPS Abstract

Refined Technical Comparison Table

Dimension	Seedream 3.0	Seedream 4.0	Key Advance
Backbone	MM-DiT (joint text-image attention, fixed-architecture)	Scalable DiT with structured block-sparse attention and 3D RoPE	Architecture complexity decoupled from task count; $O(n^2) \to O(n_\text{out}^2 + n_\text{out} n_\text{ref})$
VAE	16-ch, $f=8$ (spatial compression); standard reconstruction loss	High-compression-ratio VAE (increased $f$ or adaptive spatial pooling); fewer latent tokens per image	Sequence length $n \propto 1/f^2$; faster attention at fixed quality
Training Objective	Flow matching $\mathcal{L}_\text{FM}$ + REPA (representation alignment, $\lambda{=}0.5$, DINOv2-L target)	$\mathcal{L}\text{FM}$ + adversarial distillation ($\mathcal{L}\text{adv}$); REPA replaced by adversarial feature supervision	Distillation internalizes multi-step trajectory; single-pass quality without score accumulation
Time Sampling	Logit-normal $t \sim \sigma(\mathcal{N}(\mu_t, \sigma_t^2))$ + importance weighting	Adaptive Denoising Paradigm (ADP) → Adaptive Denoising Model (ADM); non-uniform $t$-budget per denoising stage	Compute concentrated at high-uncertainty timesteps; NFE reduced $\sim$20 → 4–8
Data Strategy	Defect-aware filtering + dual-axis TF-IDF concept balancing; bilingual corpus	Knowledge-centric curation: difficulty-rated sub-pipeline for formula/chart/diagram/UI; VLM-recaptioned at scale	Covers knowledge-dense visual domains systematically excluded from aesthetic filtering
Post-training Scope	T2I only: CT → SFT → RLHF → PE (four sequential stages)	Joint T2I + editing + multi-image: same CT→SFT→RLHF→PE stages applied over $\mathcal{T}$	Multi-task reward prevents task forgetting; unified RLHF over $p(\mathbf{y}{1:M}\|\mathbf{x}{1:N}, \mathbf{c})$
Reward Model	VLM-as-judge, scaled 1B → >20B parameters (GenRM); aesthetic + alignment scores	Continued VLM reward scaling + task-consistency reward $R_\text{consist}$; physical-plausibility judge	Scaling law for reward quality confirmed; consistency reward formalizes $N{>}0$ tasks
PE Module	Rule-based prompt expansion heuristics	End-to-end VLM (Seed1.5-VL) with AdaCoT: adaptive chain-of-thought budget $B(\mathbf{c}) \propto \text{complexity}(\mathbf{c})$	Semantic reasoning depth scales with prompt complexity; implicit constraint inference before generation
Conditioning	Text via dual encoder; no visual conditioning in base model	Text + $N$ reference image tokens; task embedding $\mathbf{e}_\tau$ via AdaLN; structured attention mask $\mathbf{A}$	Task routing without parameter duplication; zero ControlNet overhead
Positional Encoding	2D RoPE $(r, c)$ per image	3D RoPE $(i_\text{img}, r, c)$; image-index axis separates reference from output tokens	Multi-image sequences positionally coherent without learned index embeddings
Acceleration	Consistent noise expectation + importance timestep sampling	ADM distillation + 4/8-bit quantization (W4A8) + speculative decoding for PE VLM	Hardware-software co-design; quantization applies to both DiT and VLM jointly
Inference Speed	$\sim$3.0 s @ 1K resolution (no PE overhead)	$\sim$1.4 s @ 2K resolution (no PE overhead)	$\sim$16× effective efficiency gain (2× resolution, $\sim$0.47× latency)
Max Resolution	2K native (2048 px long edge)	4K native (4096 px long edge)	$4\times$ pixel count at comparable latency
Task Scope	T2I only ($N{=}0, M{=}1$)	T2I + editing + style/ID/IP + control signals + multi-image composition + storyboard ($N{\geq}0, M{\geq}1$)	$\mathcal{T} \| : 2 \to 10{+}$ from single $\theta$
KV Caching	None	Reference tokens $(\mathbf{K}^{(x)}_i, \mathbf{V}^{(x)}_i)$ cached across all $T$ denoising steps	$O(T \cdot n_\text{ref}) \to O(n_\text{ref})$ attention compute for fixed references
Evaluation Suite	Bench-377 + text rendering metrics; human A/B vs. SDXL/FLUX/MJ	MagicBench 4.0 + DreamEval (VQA-based, GPT-4o judge); multi-task consistency metrics	Richer, reproducible, task-decomposed evaluation; reduces VLM-judge circularity
Catastrophic Forgetting	N/A (single task)	Task embeddings + phased curriculum (T2I → T2I+edit → full $\mathcal{T}$) + task-floor sampling	Gradient interference bounded by soft routing; base T2I quality preserved

NeurIPS Workshop Abstract

Seedream 3.0 to 4.0: Architectural and Algorithmic Foundations of Unified Multimodal Image Generation

We characterize the mathematical and systems-level advances that distinguish Seedream 4.0 from its predecessor, Seedream 3.0, through the lens of generative modeling theory and large-scale systems design. Seedream 3.0 defines a specialist model over $p(\mathbf{y} \mid \mathbf{c}\text{text}; \theta{3.0})$, optimized via flow matching with logit-normal time sampling — $t \sim \sigma(\mathcal{N}(\mu_t, \sigma_t^2))$, $\sigma = (1+e^{-u})^{-1}$ — and a representation alignment auxiliary loss $\mathcal{L}\text{REPA} = \lambda \cdot |f\theta(\mathbf{z}t) - \text{sg}[\text{DINOv2}(\mathbf{x})]|^2$ that aligns intermediate DiT features with a frozen vision encoder. Text-to-image generation and image editing are handled by disjoint model families, precluding joint optimization and imposing combinatorial maintenance overhead as the task set grows. Seedream 4.0 replaces this fragmented paradigm with a single unified conditional distribution $p(\mathbf{y}{1:M} \mid \mathbf{x}{1:N}, \mathbf{c}; \theta{4.0})$, where $\mathbf{x}{1:N}$ are $N \geq 0$ reference images (visual controls, style exemplars, editing sources, or geometric signal maps), $\mathbf{c} = (\mathbf{c}\text{text}, \hat{\mathbf{c}}\text{VLM})$ is a prompt augmented by an adaptive chain-of-thought expansion with compute budget $B(\mathbf{c}) \propto \text{complexity}(\mathbf{c})$, and $\mathbf{y}{1:M}$ are $M \geq 1$ output images. This single parameterization spans a task space $

\mathcal{T}

\geq 10$ — including text-to-image, single-image editing, multi-reference composition, storyboard generation, style/identity transfer, and native geometric control — without per-task parameter duplication.

The architectural machinery enabling this unification is a structured extension of the Diffusion Transformer. Reference and output image latents are jointly patchified and concatenated with text tokens into a single sequence: $\mathbf{S} = [\mathbf{P}^{(x)}{1:N};\, \mathbf{P}^{(y)}{1:M};\, \mathbf{e}\text{text}]$, processed by a single DiT equipped with a block-sparse attention mask $\mathbf{A}$ enforcing the constraint that reference tokens are read-only (output tokens attend to references, but references cannot attend to outputs, preventing denoising-trajectory corruption of the conditioning context). Positional encoding is extended from 2D RoPE to a factored 3D RoPE over axes $(i\text{img}, r, c)$, enabling coherent spatial encoding across arbitrarily many images without learned index embeddings. Task identity is injected via a task embedding $\mathbf{e}\tau$ in AdaLN at each DiT block — $\text{AdaLN}(\mathbf{h}, \mathbf{e}\tau) = \boldsymbol{\gamma}(\mathbf{e}\tau) \odot \frac{\mathbf{h} - \mu}{\sigma} + \boldsymbol{\beta}(\mathbf{e}\tau)$ — providing soft task-routing that prevents catastrophic interference across tasks whose required feature extraction strategies are diametrically opposed (e.g., style transfer, which discards semantics, vs. identity preservation, which retains them). Reference token KV-states are cached across all $T$ denoising steps, reducing reference-attention cost from $O(T \cdot n_\text{ref})$ to $O(n_\text{ref})$ per layer.

Training advances operate at three levels. At the data level, any image in the training corpus can generate training tuples for multiple tasks $\tau$ via deterministic preprocessing (Canny extraction, depth estimation, identity masking), multiplying effective data scale by $

\mathcal{T}

$ without new collection. A knowledge-centric curation sub-pipeline — difficulty-rated across formula, chart, diagram, and UI domains — systematically covers the visual knowledge space excluded by aesthetic filtering alone, a category where competing T2I systems exhibit systematic failure modes. At the objective level, the REPA auxiliary loss is replaced by adversarial distillation $\mathcal{L}\text{adv}$, which internalizes the multi-step score trajectory into the model’s single-pass output, reducing inference NFE from $\sim$20–30 (3.0) to 4–8 (4.0). The reward model in post-training is extended with a task-consistency term $R\text{consist}(\mathbf{y}, \mathbf{x}{1:N}) = \sum_i [\lambda\text{DINO} \cdot \text{DINO}(\mathbf{y}, \mathbf{x}i) + \lambda\text{ctrl} \cdot F_1(\hat{\mathcal{E}}(\mathbf{y}), \mathcal{E}(\mathbf{x}_i))]$ and a VLM physical-plausibility judge, constituting a process reward model applied to image generation for the first time. A phased training curriculum — T2I only, then T2I + editing, then full $\mathcal{T}$, with task-floor sampling $p(\tau) \geq \epsilon$ — bounds gradient interference and preserves base generation quality.

At the inference systems level, Seedream 4.0 realizes a holistic hardware-software co-design not present in 3.0. The Adaptive Denoising Model (ADM) allocates NFE budget non-uniformly across timesteps, concentrating evaluations at high-uncertainty regions of the denoising trajectory (mid-$t$, where $\nabla_\mathbf{z} \log p_t$ has highest curvature) and coarsening steps near $t \in {0, 1}$ where the velocity field is approximately linear. Combined with 4/8-bit quantization (W4A8) applied jointly to the DiT backbone and the Seed1.5-VL prompt expansion module, and speculative decoding for the autoregressive VLM component, the system achieves $\sim$1.4 s per image at 2K resolution — compared to $\sim$3.0 s at 1K in 3.0 — representing an effective efficiency gain of $\sim$16× when accounting for the $4\times$ pixel-count increase. Native 4K generation becomes tractable via the high-compression-ratio VAE and block-sparse attention, whereas 3.0’s architectural assumptions made 4K generation prohibitively expensive. The evaluation suite is correspondingly updated: MagicBench 4.0 and DreamEval introduce VQA-based, task-decomposed metrics with GPT-4o judging, replacing the Bench-377 and text-rendering metrics of 3.0 with a richer protocol that measures the full $

\mathcal{T}

$ task space and reduces judge-model circularity through rubric-constrained scoring.

These advances collectively demonstrate that a single DiT, conditioned via structured token sequences, attention masking, and task embeddings, can learn the combinatorially large distribution $p(\mathbf{y}{1:M} \mid \mathbf{x}{1:N}, \mathbf{c})$ without architectural fragmentation or catastrophic forgetting — the representational burden absorbed by conditioning structure rather than parameter multiplication. Two open research questions follow directly. First, the structured attention mask $\mathbf{A}$ is currently hand-designed per task type: can $\mathbf{A}$ itself be learned end-to-end, perhaps via differentiable sparse attention (Correia et al., 2019) or latent graph structure inference, such that the model discovers optimal inter-image information routing without human-specified task taxonomy? A learned $\mathbf{A}$ would generalize to novel task types at inference time without retraining, but requires differentiating through a discrete combinatorial object. Second, the AdaCoT prompt expansion module allocates reasoning budget $B(\mathbf{c})$ proportional to prompt complexity, but the mapping complexity$(\mathbf{c})$ is itself a learned heuristic: what is the optimal policy for reasoning budget allocation in a joint image generation + VLM system, and does it correspond to an interpretable quantity such as conditional entropy $\mathcal{H}(\mathbf{y} \mid \mathbf{c})$ or the expected KL divergence between the unexpanded and expanded conditional distributions $\mathbb{E}[D_\text{KL}(p(\mathbf{y} \mid \hat{\mathbf{c}}) | p(\mathbf{y} \mid \mathbf{c}_\text{raw}))]$? Answering this would ground adaptive reasoning allocation in information-theoretic foundations and potentially enable compute-optimal test-time scaling for generative vision models.