Guiding Token-Sparse Diffusion Models

Overview

Diffusion models deliver high-quality image synthesis but remain expensive. Token sparsity methods reduce costs by processing only a subset of tokens (masking or routing), improving training throughput and often strengthening the conditional model. However, in practice these models struggle at inference because they respond poorly to Classifier-free Guidance (CFG), limiting achievable fidelity and slowing adoption.

We propose Sparse Guidance (SG): instead of using conditional dropout as the guidance signal, SG uses token-level sparsity to create a controllable capacity gap between two conditional predictions. This preserves the variance of the conditional prediction (reducing CFG-style collapse) while improving fidelity — and it lets us trade compute for quality at test time.

Motivation: When CFG Fails

Token-sparse training can yield strong conditionals, but CFG relies on an unconditional branch that behaves differently from the conditional prediction. Under token sparsity, this relationship often breaks: CFG gains shrink, and sometimes samples become unstable or collapse.

CFG provides limited benefits for token-sparse diffusion models; Sparse Guidance restores strong guidance gains. — CFG provides limited benefits for token-sparse diffusion models; Sparse Guidance (SG) restores strong guidance gains.

Token Sparsity: Masking vs Routing

We study Sparse Guidance on two common token-sparsity mechanisms. Masking permanently drops a fraction of tokens (optionally replacing them with a learned “mask” embedding). Routing skips computation for selected tokens in a subset of layers and later reinserts them unchanged, preserving instance-specific information.

Masking drops tokens; routing skips compute and reinserts tokens. — Masking drops tokens; routing skips compute and reinserts tokens later.

Method: Sparse Guidance

Sparse Guidance revisits sparsity as a test-time control signal. We evaluate the same network under two different sparsity rates: a low-sparsity strong branch and a high-sparsity weak branch. Both predictions are conditional, and the guidance signal comes purely from the capacity gap induced by sparsity.

$$D^{\text{strong}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{strong}}), \quad D^{\text{weak}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{weak}}), \quad 0 \le \gamma_{\text{strong}} < \gamma_{\text{weak}} < 1.$$
$$D^{\text{SG}}_{\theta}(c, \gamma_{\text{strong}}, \gamma_{\text{weak}}, \omega) = \omega\, D^{\text{strong}}_{\theta}(c) + (1-\omega)\, D^{\text{weak}}_{\theta}(c).$$

Intuitively: increasing sparsity lowers effective capacity and softens the conditional distribution; decreasing sparsity yields a sharper, higher-capacity predictor. SG uses the weak branch to steer the strong one, while staying closer to the conditional prediction than CFG typically does.

Naively applying inference-time sparsity degrades outputs as sparsity increases. — Naively applying inference-time sparsity (without SG) degrades quality as sparsity increases.

Results

ImageNet-256: Quality–Compute Trade-off

SG gives a simple knob between quality and speed. In our ImageNet-256 experiments, SG reaches 1.58 FID (SG_FID) and also provides a fast configuration (SG_FLOPS) that achieves large compute savings while remaining competitive in fidelity.

SG outperforms other guidance methods in FID and GFLOPs. — Left: SG improves both fidelity (FID) and inference compute compared to alternative guidance methods. Right: state-of-the-art comparison on ImageNet-256 (FID / sFID / IS / Precision / Recall).

State-of-the-art comparison on ImageNet-256; SG reaches 1.58 FID. — Left: SG improves both fidelity (FID) and inference compute compared to alternative guidance methods. Right: state-of-the-art comparison on ImageNet-256 (FID / sFID / IS / Precision / Recall).

Robust Hyperparameters & Sparsity Scheduling

SG is well-behaved across a wide range of guidance scales ω when we tune the sparsity pair (γ_strong, γ_weak). Larger ω consistently tolerates higher total sparsity, enabling higher throughput at matched quality.

FID heatmaps over (gamma_strong, gamma_weak) across different guidance scales omega. — FID heatmaps over (γ_strong, γ_weak) for ω ∈ {1.3, 1.5, 1.7, 1.9}.

Practical Usage

Pick a clear capacity gap: start with γ_strong < γ_weak (e.g., 0.4–0.6 vs 0.6–0.8).
Increase ω and sparsity together for speed: larger ω supports higher (γ_strong, γ_weak).
Prefer routing when available: it preserves instance information and is less sensitive to hyperparameters.

Text-to-Image Results

We apply SG to a 2.5B text-to-image Diffusion Transformer trained with routing sparsity. SG improves human preference (HPSv3) across categories and increases throughput (0.32 → 0.49 images/s on an H200 GPU).

HPSv3 results for TR-DiT-2.5B, showing improvements from SG over CFG and unguided. — HPSv3 scores for TR-DiT-2.5B: SG improves over CFG across categories and increases throughput.

Selected text-to-image examples comparing conditional prediction, CFG, and SG. — Selected examples: SG keeps more of the conditional structure while staying faithful to the prompt.

BibTeX

@misc{krause2026guidingtokensparsediffusionmodels,
      title={Guiding Token-Sparse Diffusion Models}, 
      author={Felix Krause and Stefan Andreas Baumann and Johannes Schusterbauer and Olga Grebenkova and Ming Gui and Vincent Tao Hu and Björn Ommer},
      year={2026},
      eprint={2601.01608},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.01608}, 
}