TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

TL;DR: We propose TREAD, a new method to increase the efficiency of diffusion training by improving upon iteration speed and convergence at the same time. TREAD does not use any extra parameters, pretrained models or additional losses. All changes are training-time only and do not affect inference. For this, we use uni-directional token transportation to modulate the information flow in the network. In contrast to masking or pruning, TREAD does not discard any information, but rather reuses it at deeper layers. Using TREAD, we achieve a 37x speedup in training time and show significantly lower FID compared to the unmodified baseline. Further, we show that TREAD can be applied to a variety of token-based architectures like State-Space Models.

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.

TREAD: Routing scheme — Consecutive layers have highly similar output. The effects of the routing mechanism are evident in the cosine similarities between layers. For a route r_3→8, L₂ exhibits high similarity with the routed layers. This is interpreted as an adaptation of L₂ to r_3→8.

Residual-based architectures, including transformers, can interpret outputs from preceding layers. We demonstrate this property (above), where the cosine similarity between the outputs of all layers in a trained network is shown.

During each training step, a randomly chosen subset of tokens is “teleported” from an early layer i directly to a deeper layer j (a route i → j). Tokens that take the shortcut skip the self-attention and feed-forward modules in all layers between i and j, yielding lower FLOPs and higher training throughput. Because the routed tokens re-enter the network, no information is lost, and early blocks receive a deep-supervision signal when their activations are compared against late-stage objectives. In practice, routes that begin just after the start (i ≈ 2), span most of the depth (j ≈ B-4), and involve 50 % of tokens strike the best balance between speed-up and sample quality.

Quantitative Results

We report FID scores across different architectures trained for 400K iterations on ImageNet-256. Our method consistently improves FID across all backbone sizes.

FID comparison table (400K) — Performance comparison across all model configurations. Left: Trained for benchmark settings (80K / 400K) iterations. Right: final comparison against state-of-the-art competitors.

FID comparison table (additional settings) — Performance comparison across all model configurations. Left: Trained for benchmark settings (80K / 400K) iterations. Right: final comparison against state-of-the-art competitors.

On ImageNet-256 (class-conditional), DiT-XL/2 + TREAD reaches an unguided FID of 3.93 after 400K updates, which is 14x faster than the baseline and 37x faster than DiT's published best checkpoint at 7M steps. Finetuning without routing achieves a final guided FID of 2.09. Furthermore, the same routing idea transfers to state-space models (e.g., RWKV), to 512² resolution, and to the text-to-image benchmark MS-COCO. Additionally, we find it stacks with representation distillation methods like REPA.

Qualitative Results

Examples of a DiT-XL/2 + TREAD. For more (uncurated) examples, please refer to the appendix in our paper.

Analysis

Route Placement

TREAD: Inference routing scheme — FID for different route placements in DiT-B/2 + TREAD.

Guidelines for route placement:

1. Long routes improve convergence and training efficiency.
2. Ending the route before the final layers is essential.
3. Not starting the route immediately at the beginning brings further improvements.

Furthermore, we test scaling behavior with increasing layer number (from a DiT-B/2 with 12 layers to a DiT-XL/2 with 28 layers). We find that absolute scaling (i=2, j=B-4) works best for all tested models, while relative scaling shows little performance gain. We conclude that route placement is crucial but can be effectively determined using the provided guidelines.

BibTeX

@article{krause2025tread,
            title={TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training},
            author={Krause, Felix and Phan, Timy and Gui, Ming and Baumann, Stefan Andreas and Hu, Vincent Tao and Ommer,
            Bj{\"o}rn},
            journal={arXiv preprint arXiv:2501.04765},
            year={2025}             
        }

TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

ICCV 2025

Overview

Motivation

Method

Results

Quantitative Results

Qualitative Results

Analysis

Route Placement

BibTeX