Guiding Token-Sparse Diffusion Models

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

CompVis @ LMU Munich
Munich Center for Machine Learning (MCML)
Sparse Guidance overview and headline results.

TL;DR: Token-sparse diffusion models (masking / routing) train fast and can be strong conditionals — but Classifier-free Guidance (CFG) often breaks at inference. We introduce Sparse Guidance (SG), a finetune-free guidance rule that uses token sparsity as the guidance signal: combine a strong low-sparsity conditional prediction with a weak high-sparsity conditional prediction. On ImageNet-256, SG reaches 1.58 FID with 25% fewer FLOPs and enables up to 58% FLOP savings at matched baseline quality. SG also scales to a 2.5B text-to-image model, improving human preference and throughput.



Overview

Diffusion models deliver high-quality image synthesis but remain expensive. Token sparsity methods reduce costs by processing only a subset of tokens (masking or routing), improving training throughput and often strengthening the conditional model. However, in practice these models struggle at inference because they respond poorly to Classifier-free Guidance (CFG), limiting achievable fidelity and slowing adoption.

We propose Sparse Guidance (SG): instead of using conditional dropout as the guidance signal, SG uses token-level sparsity to create a controllable capacity gap between two conditional predictions. This preserves the variance of the conditional prediction (reducing CFG-style collapse) while improving fidelity — and it lets us trade compute for quality at test time.

Motivation: When CFG Fails

Token-sparse training can yield strong conditionals, but CFG relies on an unconditional branch that behaves differently from the conditional prediction. Under token sparsity, this relationship often breaks: CFG gains shrink, and sometimes samples become unstable or collapse.

CFG provides limited benefits for token-sparse diffusion models; Sparse Guidance restores strong guidance gains.
CFG provides limited benefits for token-sparse diffusion models; Sparse Guidance (SG) restores strong guidance gains.

Token Sparsity: Masking vs Routing

We study Sparse Guidance on two common token-sparsity mechanisms. Masking permanently drops a fraction of tokens (optionally replacing them with a learned “mask” embedding). Routing skips computation for selected tokens in a subset of layers and later reinserts them unchanged, preserving instance-specific information.

Masking drops tokens; routing skips compute and reinserts tokens.
Masking drops tokens; routing skips compute and reinserts tokens later.

Method: Sparse Guidance

Sparse Guidance revisits sparsity as a test-time control signal. We evaluate the same network under two different sparsity rates: a low-sparsity strong branch and a high-sparsity weak branch. Both predictions are conditional, and the guidance signal comes purely from the capacity gap induced by sparsity.

$$D^{\text{strong}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{strong}}), \quad D^{\text{weak}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{weak}}), \quad 0 \le \gamma_{\text{strong}} < \gamma_{\text{weak}} < 1.$$
$$D^{\text{SG}}_{\theta}(c, \gamma_{\text{strong}}, \gamma_{\text{weak}}, \omega) = \omega\, D^{\text{strong}}_{\theta}(c) + (1-\omega)\, D^{\text{weak}}_{\theta}(c).$$

Intuitively: increasing sparsity lowers effective capacity and softens the conditional distribution; decreasing sparsity yields a sharper, higher-capacity predictor. SG uses the weak branch to steer the strong one, while staying closer to the conditional prediction than CFG typically does.

Naively applying inference-time sparsity degrades outputs as sparsity increases.
Naively applying inference-time sparsity (without SG) degrades quality as sparsity increases.

Results

ImageNet-256: Quality–Compute Trade-off

SG gives a simple knob between quality and speed. In our ImageNet-256 experiments, SG reaches 1.58 FID (SGFID) and also provides a fast configuration (SGFLOPS) that achieves large compute savings while remaining competitive in fidelity.

SG outperforms other guidance methods in FID and GFLOPs. State-of-the-art comparison on ImageNet-256; SG reaches 1.58 FID.
Left: SG improves both fidelity (FID) and inference compute compared to alternative guidance methods. Right: state-of-the-art comparison on ImageNet-256 (FID / sFID / IS / Precision / Recall).

Robust Hyperparameters & Sparsity Scheduling

SG is well-behaved across a wide range of guidance scales ω when we tune the sparsity pair (γstrong, γweak). Larger ω consistently tolerates higher total sparsity, enabling higher throughput at matched quality.

FID heatmaps over (gamma_strong, gamma_weak) across different guidance scales omega.
FID heatmaps over (γstrong, γweak) for ω ∈ {1.3, 1.5, 1.7, 1.9}.

Practical Usage

  1. Pick a clear capacity gap: start with γstrong < γweak (e.g., 0.4–0.6 vs 0.6–0.8).
  2. Increase ω and sparsity together for speed: larger ω supports higher (γstrong, γweak).
  3. Prefer routing when available: it preserves instance information and is less sensitive to hyperparameters.

Text-to-Image Results

We apply SG to a 2.5B text-to-image Diffusion Transformer trained with routing sparsity. SG improves human preference (HPSv3) across categories and increases throughput (0.32 → 0.49 images/s on an H200 GPU).

HPSv3 results for TR-DiT-2.5B, showing improvements from SG over CFG and unguided.
HPSv3 scores for TR-DiT-2.5B: SG improves over CFG across categories and increases throughput.
Selected text-to-image examples comparing conditional prediction, CFG, and SG.
Selected examples: SG keeps more of the conditional structure while staying faithful to the prompt.

BibTeX

@misc{krause2026guidingtokensparsediffusionmodels,
      title={Guiding Token-Sparse Diffusion Models}, 
      author={Felix Krause and Stefan Andreas Baumann and Johannes Schusterbauer and Olga Grebenkova and Ming Gui and Vincent Tao Hu and Björn Ommer},
      year={2026},
      eprint={2601.01608},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.01608}, 
}