| Dimension | Seedream 3.0 | Seedream 4.0 | Key Advance |
|---|---|---|---|
| Backbone | MM-DiT (joint text-image attention, fixed-architecture) | Scalable DiT with structured block-sparse attention and 3D RoPE | Architecture complexity decoupled from task count; $O(n^2) \to O(n_\text{out}^2 + n_\text{out} n_\text{ref})$ |
| VAE | 16-ch, $f=8$ (spatial compression); standard reconstruction loss | High-compression-ratio VAE (increased $f$ or adaptive spatial pooling); fewer latent tokens per image | Sequence length $n \propto 1/f^2$; faster attention at fixed quality |
| Training Objective | Flow matching $\mathcal{L}_\text{FM}$ + REPA (representation alignment, $\lambda{=}0.5$, DINOv2-L target) | $\mathcal{L}\text{FM}$ + adversarial distillation ($\mathcal{L}\text{adv}$); REPA replaced by adversarial feature supervision | Distillation internalizes multi-step trajectory; single-pass quality without score accumulation |
| Time Sampling | Logit-normal $t \sim \sigma(\mathcal{N}(\mu_t, \sigma_t^2))$ + importance weighting | Adaptive Denoising Paradigm (ADP) → Adaptive Denoising Model (ADM); non-uniform $t$-budget per denoising stage | Compute concentrated at high-uncertainty timesteps; NFE reduced $\sim$20 → 4–8 |
| Data Strategy | Defect-aware filtering + dual-axis TF-IDF concept balancing; bilingual corpus | Knowledge-centric curation: difficulty-rated sub-pipeline for formula/chart/diagram/UI; VLM-recaptioned at scale | Covers knowledge-dense visual domains systematically excluded from aesthetic filtering |
| Post-training Scope | T2I only: CT → SFT → RLHF → PE (four sequential stages) | Joint T2I + editing + multi-image: same CT→SFT→RLHF→PE stages applied over $\mathcal{T}$ | Multi-task reward prevents task forgetting; unified RLHF over $p(\mathbf{y}{1:M}|\mathbf{x}{1:N}, \mathbf{c})$ |
| Reward Model | VLM-as-judge, scaled 1B → >20B parameters (GenRM); aesthetic + alignment scores | Continued VLM reward scaling + task-consistency reward $R_\text{consist}$; physical-plausibility judge | Scaling law for reward quality confirmed; consistency reward formalizes $N{>}0$ tasks |
| PE Module | Rule-based prompt expansion heuristics | End-to-end VLM (Seed1.5-VL) with AdaCoT: adaptive chain-of-thought budget $B(\mathbf{c}) \propto \text{complexity}(\mathbf{c})$ | Semantic reasoning depth scales with prompt complexity; implicit constraint inference before generation |
| Conditioning | Text via dual encoder; no visual conditioning in base model | Text + $N$ reference image tokens; task embedding $\mathbf{e}_\tau$ via AdaLN; structured attention mask $\mathbf{A}$ | Task routing without parameter duplication; zero ControlNet overhead |
| Positional Encoding | 2D RoPE $(r, c)$ per image | 3D RoPE $(i_\text{img}, r, c)$; image-index axis separates reference from output tokens | Multi-image sequences positionally coherent without learned index embeddings |
| Acceleration | Consistent noise expectation + importance timestep sampling | ADM distillation + 4/8-bit quantization (W4A8) + speculative decoding for PE VLM | Hardware-software co-design; quantization applies to both DiT and VLM jointly |
| Inference Speed | $\sim$3.0 s @ 1K resolution (no PE overhead) | $\sim$1.4 s @ 2K resolution (no PE overhead) | $\sim$16× effective efficiency gain (2× resolution, $\sim$0.47× latency) |
| Max Resolution | 2K native (2048 px long edge) | 4K native (4096 px long edge) | $4\times$ pixel count at comparable latency |
| Task Scope | T2I only ($N{=}0, M{=}1$) | T2I + editing + style/ID/IP + control signals + multi-image composition + storyboard ($N{\geq}0, M{\geq}1$) | $\mathcal{T} | : 2 \to 10{+}$ from single $\theta$ |
| KV Caching | None | Reference tokens $(\mathbf{K}^{(x)}_i, \mathbf{V}^{(x)}_i)$ cached across all $T$ denoising steps | $O(T \cdot n_\text{ref}) \to O(n_\text{ref})$ attention compute for fixed references |
| Evaluation Suite | Bench-377 + text rendering metrics; human A/B vs. SDXL/FLUX/MJ | MagicBench 4.0 + DreamEval (VQA-based, GPT-4o judge); multi-task consistency metrics | Richer, reproducible, task-decomposed evaluation; reduces VLM-judge circularity |
| Catastrophic Forgetting | N/A (single task) | Task embeddings + phased curriculum (T2I → T2I+edit → full $\mathcal{T}$) + task-floor sampling | Gradient interference bounded by soft routing; base T2I quality preserved |
Seedream 3.0 to 4.0: Architectural and Algorithmic Foundations of Unified Multimodal Image Generation
| We characterize the mathematical and systems-level advances that distinguish Seedream 4.0 from its predecessor, Seedream 3.0, through the lens of generative modeling theory and large-scale systems design. Seedream 3.0 defines a specialist model over $p(\mathbf{y} \mid \mathbf{c}\text{text}; \theta{3.0})$, optimized via flow matching with logit-normal time sampling — $t \sim \sigma(\mathcal{N}(\mu_t, \sigma_t^2))$, $\sigma = (1+e^{-u})^{-1}$ — and a representation alignment auxiliary loss $\mathcal{L}\text{REPA} = \lambda \cdot |f\theta(\mathbf{z}t) - \text{sg}[\text{DINOv2}(\mathbf{x})]|^2$ that aligns intermediate DiT features with a frozen vision encoder. Text-to-image generation and image editing are handled by disjoint model families, precluding joint optimization and imposing combinatorial maintenance overhead as the task set grows. Seedream 4.0 replaces this fragmented paradigm with a single unified conditional distribution $p(\mathbf{y}{1:M} \mid \mathbf{x}{1:N}, \mathbf{c}; \theta{4.0})$, where $\mathbf{x}{1:N}$ are $N \geq 0$ reference images (visual controls, style exemplars, editing sources, or geometric signal maps), $\mathbf{c} = (\mathbf{c}\text{text}, \hat{\mathbf{c}}\text{VLM})$ is a prompt augmented by an adaptive chain-of-thought expansion with compute budget $B(\mathbf{c}) \propto \text{complexity}(\mathbf{c})$, and $\mathbf{y}{1:M}$ are $M \geq 1$ output images. This single parameterization spans a task space $ | \mathcal{T} | \geq 10$ — including text-to-image, single-image editing, multi-reference composition, storyboard generation, style/identity transfer, and native geometric control — without per-task parameter duplication. |
The architectural machinery enabling this unification is a structured extension of the Diffusion Transformer. Reference and output image latents are jointly patchified and concatenated with text tokens into a single sequence: $\mathbf{S} = [\mathbf{P}^{(x)}{1:N};\, \mathbf{P}^{(y)}{1:M};\, \mathbf{e}\text{text}]$, processed by a single DiT equipped with a block-sparse attention mask $\mathbf{A}$ enforcing the constraint that reference tokens are read-only (output tokens attend to references, but references cannot attend to outputs, preventing denoising-trajectory corruption of the conditioning context). Positional encoding is extended from 2D RoPE to a factored 3D RoPE over axes $(i\text{img}, r, c)$, enabling coherent spatial encoding across arbitrarily many images without learned index embeddings. Task identity is injected via a task embedding $\mathbf{e}\tau$ in AdaLN at each DiT block — $\text{AdaLN}(\mathbf{h}, \mathbf{e}\tau) = \boldsymbol{\gamma}(\mathbf{e}\tau) \odot \frac{\mathbf{h} - \mu}{\sigma} + \boldsymbol{\beta}(\mathbf{e}\tau)$ — providing soft task-routing that prevents catastrophic interference across tasks whose required feature extraction strategies are diametrically opposed (e.g., style transfer, which discards semantics, vs. identity preservation, which retains them). Reference token KV-states are cached across all $T$ denoising steps, reducing reference-attention cost from $O(T \cdot n_\text{ref})$ to $O(n_\text{ref})$ per layer.
| Training advances operate at three levels. At the data level, any image in the training corpus can generate training tuples for multiple tasks $\tau$ via deterministic preprocessing (Canny extraction, depth estimation, identity masking), multiplying effective data scale by $ | \mathcal{T} | $ without new collection. A knowledge-centric curation sub-pipeline — difficulty-rated across formula, chart, diagram, and UI domains — systematically covers the visual knowledge space excluded by aesthetic filtering alone, a category where competing T2I systems exhibit systematic failure modes. At the objective level, the REPA auxiliary loss is replaced by adversarial distillation $\mathcal{L}\text{adv}$, which internalizes the multi-step score trajectory into the model’s single-pass output, reducing inference NFE from $\sim$20–30 (3.0) to 4–8 (4.0). The reward model in post-training is extended with a task-consistency term $R\text{consist}(\mathbf{y}, \mathbf{x}{1:N}) = \sum_i [\lambda\text{DINO} \cdot \text{DINO}(\mathbf{y}, \mathbf{x}i) + \lambda\text{ctrl} \cdot F_1(\hat{\mathcal{E}}(\mathbf{y}), \mathcal{E}(\mathbf{x}_i))]$ and a VLM physical-plausibility judge, constituting a process reward model applied to image generation for the first time. A phased training curriculum — T2I only, then T2I + editing, then full $\mathcal{T}$, with task-floor sampling $p(\tau) \geq \epsilon$ — bounds gradient interference and preserves base generation quality. |
| At the inference systems level, Seedream 4.0 realizes a holistic hardware-software co-design not present in 3.0. The Adaptive Denoising Model (ADM) allocates NFE budget non-uniformly across timesteps, concentrating evaluations at high-uncertainty regions of the denoising trajectory (mid-$t$, where $\nabla_\mathbf{z} \log p_t$ has highest curvature) and coarsening steps near $t \in {0, 1}$ where the velocity field is approximately linear. Combined with 4/8-bit quantization (W4A8) applied jointly to the DiT backbone and the Seed1.5-VL prompt expansion module, and speculative decoding for the autoregressive VLM component, the system achieves $\sim$1.4 s per image at 2K resolution — compared to $\sim$3.0 s at 1K in 3.0 — representing an effective efficiency gain of $\sim$16× when accounting for the $4\times$ pixel-count increase. Native 4K generation becomes tractable via the high-compression-ratio VAE and block-sparse attention, whereas 3.0’s architectural assumptions made 4K generation prohibitively expensive. The evaluation suite is correspondingly updated: MagicBench 4.0 and DreamEval introduce VQA-based, task-decomposed metrics with GPT-4o judging, replacing the Bench-377 and text-rendering metrics of 3.0 with a richer protocol that measures the full $ | \mathcal{T} | $ task space and reduces judge-model circularity through rubric-constrained scoring. |
These advances collectively demonstrate that a single DiT, conditioned via structured token sequences, attention masking, and task embeddings, can learn the combinatorially large distribution $p(\mathbf{y}{1:M} \mid \mathbf{x}{1:N}, \mathbf{c})$ without architectural fragmentation or catastrophic forgetting — the representational burden absorbed by conditioning structure rather than parameter multiplication. Two open research questions follow directly. First, the structured attention mask $\mathbf{A}$ is currently hand-designed per task type: can $\mathbf{A}$ itself be learned end-to-end, perhaps via differentiable sparse attention (Correia et al., 2019) or latent graph structure inference, such that the model discovers optimal inter-image information routing without human-specified task taxonomy? A learned $\mathbf{A}$ would generalize to novel task types at inference time without retraining, but requires differentiating through a discrete combinatorial object. Second, the AdaCoT prompt expansion module allocates reasoning budget $B(\mathbf{c})$ proportional to prompt complexity, but the mapping complexity$(\mathbf{c})$ is itself a learned heuristic: what is the optimal policy for reasoning budget allocation in a joint image generation + VLM system, and does it correspond to an interpretable quantity such as conditional entropy $\mathcal{H}(\mathbf{y} \mid \mathbf{c})$ or the expected KL divergence between the unexpanded and expanded conditional distributions $\mathbb{E}[D_\text{KL}(p(\mathbf{y} \mid \hat{\mathbf{c}}) | p(\mathbf{y} \mid \mathbf{c}_\text{raw}))]$? Answering this would ground adaptive reasoning allocation in information-theoretic foundations and potentially enable compute-optimal test-time scaling for generative vision models.