Diffusion models deliver high-quality image synthesis but remain expensive. Token sparsity methods reduce costs by processing only a subset of tokens (masking or routing), improving training throughput and often strengthening the conditional model. However, in practice these models struggle at inference because they respond poorly to Classifier-free Guidance (CFG), limiting achievable fidelity and slowing adoption.
We propose Sparse Guidance (SG): instead of using conditional dropout as the guidance signal, SG uses token-level sparsity to create a controllable capacity gap between two conditional predictions. This preserves the variance of the conditional prediction (reducing CFG-style collapse) while improving fidelity — and it lets us trade compute for quality at test time.
Token-sparse training can yield strong conditionals, but CFG relies on an unconditional branch that behaves differently from the conditional prediction. Under token sparsity, this relationship often breaks: CFG gains shrink, and sometimes samples become unstable or collapse.
We study Sparse Guidance on two common token-sparsity mechanisms. Masking permanently drops a fraction of tokens (optionally replacing them with a learned “mask” embedding). Routing skips computation for selected tokens in a subset of layers and later reinserts them unchanged, preserving instance-specific information.
Sparse Guidance revisits sparsity as a test-time control signal. We evaluate the same network under two different sparsity rates: a low-sparsity strong branch and a high-sparsity weak branch. Both predictions are conditional, and the guidance signal comes purely from the capacity gap induced by sparsity.
$$D^{\text{strong}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{strong}}), \quad
D^{\text{weak}}_{\theta}(c) := D_{\theta}(x_t, t, c; \gamma_{\text{weak}}), \quad
0 \le \gamma_{\text{strong}} < \gamma_{\text{weak}} < 1.$$
$$D^{\text{SG}}_{\theta}(c, \gamma_{\text{strong}}, \gamma_{\text{weak}}, \omega)
= \omega\, D^{\text{strong}}_{\theta}(c) + (1-\omega)\, D^{\text{weak}}_{\theta}(c).$$
Intuitively: increasing sparsity lowers effective capacity and softens the conditional distribution; decreasing sparsity yields a sharper, higher-capacity predictor. SG uses the weak branch to steer the strong one, while staying closer to the conditional prediction than CFG typically does.
SG gives a simple knob between quality and speed. In our ImageNet-256 experiments, SG reaches 1.58 FID (SGFID) and also provides a fast configuration (SGFLOPS) that achieves large compute savings while remaining competitive in fidelity.
SG is well-behaved across a wide range of guidance scales ω when we tune the sparsity pair (γstrong, γweak). Larger ω consistently tolerates higher total sparsity, enabling higher throughput at matched quality.
We apply SG to a 2.5B text-to-image Diffusion Transformer trained with routing sparsity. SG improves human preference (HPSv3) across categories and increases throughput (0.32 → 0.49 images/s on an H200 GPU).
@misc{krause2026guidingtokensparsediffusionmodels,
title={Guiding Token-Sparse Diffusion Models},
author={Felix Krause and Stefan Andreas Baumann and Johannes Schusterbauer and Olga Grebenkova and Ming Gui and Vincent Tao Hu and Björn Ommer},
year={2026},
eprint={2601.01608},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.01608},
}