TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

ICCV 2025

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, Björn Ommer

CompVis @ LMU Munich
Munich Center for Machine Learning (MCML)
We propose TREAD, a new method to increase the efficiency of 
                diffusion training by improving upon iteration speed and performance at the same time. 
                For this, we use uni-directional token transportation to modulate the information flow in the network.

TL;DR: We propose TREAD, a new method to increase the efficiency of diffusion training by improving upon iteration speed and performance at the same time. TREAD does not use For this, we use uni-directional token transportation to modulate the information flow in the network any change in architecture itself and no additional pretrained models. Using TREAD, we achieve a 37x speedup in training time and show significantly lower FID compared to the unmodified baseline as training continues. Further, we show that TREAD can be applied to a variety of token-based architectures like State-Space Models.



Overview

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.

Motivation

TREAD: Routing scheme
Consecutive layers have highly similar output. The effects of the routing mechanism are evident in the cosine similari- ties between layers. For a r2→8, L2 exhibits high similarity with the routed layers. This is interpreted as an adaptation of L2 to 2→8. .

Residual-based architectures, including transformers, can interpret outputs from preceding layers. We demonstrate this property (above), where the cosine similarity between the outputs of all layers in a trained network is shown. This characteristic can be leveraged either to cache layer outputs from previous timesteps for improved inference or to route tokens from one layer to deeper layers, as proposed in TREAD.

Method

TREAD: Routing scheme
During trainig we modify the forward pass with a route from layer $i$ to layer $j$.

During each training step, a randomly chosen subset of tokens is “teleported” from an early layer i directly to a deeper layer j (a route i → j). Tokens that take the shortcut skip the self-attention and feed-forward modules in all layers between i and j, yielding lower FLOPs and higher training throughput. Because the routed tokens re-enter the network, no information is lost, and early blocks receive a deep-supervision signal when their activations are compared against late-stage objectives. In practice, routes that begin just after the start (i ≈ 2), span most of the depth (j ≈ B-4), and involve 50 % of tokens strike the best balance between speed-up and sample quality.

Results

Quantitative Results

We report FID scores across different architectures trained for 400K iterations on ImageNet-256. Our method consistently improves FID across all backbone sizes.

FID comparison table (400K) FID comparison table (additional settings)
Performance comparison across all model configurations. Left: trained for 400K iterations. Right: ablation or additional variant.

On ImageNet-256 (class-conditional), DiT-XL/2 + TREAD reaches an unguided FID of 3.93 after 400K updates, which is 14x faster than the baseline and 37x faster than DiT's published best checkpoint at 7M steps. Finetuning without routing achieves a final guided FID of 2.09. Furthermore, the same routing idea transfers to state-space models (e.g., RWKV), to 512² resolution, and to the text-to-image benchmark MS-COCO. Additionally, we find it stacks with representation distillation methods like REPA.

Qualitative Results

Examples of a DiT-XL/2 + TREAD
Examples of a DiT-XL/2 + TREAD. For more (uncurated) examples, please refer to the appendix in our paper.

Analysis

Route Placement

TREAD: Inference routing scheme
FID for different route placements in DiT-B/2 + TREAD.

Guidelines for route placement:

  1. 1. Long routes improve convergence and training efficiency.
  2. 2. Ending the route before the final layers is essential.
  3. 3. Not starting the route immediately at the beginning brings further improvements.

Furthermore, we test scaling behavior with increasing layer number (from a DiT-B/2 with 12 layers to a DiT-XL/2 with 28 layers). We find that absolute scaling (i=2, j=B-4) works best for all tested models, while relative scaling shows little performance gain. From this we conclude that route placement is crucial but can be solved with the provided guidelines.

BibTeX

@article{krause2025tread,
            title={TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training},
            author={Krause, Felix and Phan, Timy and Gui, Ming and Baumann, Stefan Andreas and Hu, Vincent Tao and Ommer,
            Bj{\"o}rn},
            journal={arXiv preprint arXiv:2501.04765},
            year={2025}             
        }