Qualitative Results

Diffusion models have emerged as the mainstream approach for visual generation. However, these models typically suffer from sample inefficiency and high training costs. Consequently, methods for efficient finetuning, inference and personalization were quickly adopted by the community. However, training these models in the first place remains very costly. While several recent approaches - including masking, distillation, and architectural modifications - have been proposed to improve training efficiency, each of these methods comes with a tradeoff: they achieve enhanced performance at the expense of increased computational cost or vice versa. In contrast, this work aims to improve training efficiency as well as generative performance at the same time through routes that act as a transport mechanism for randomly selected tokens from early layers to deeper layers of the model. Our method is not limited to the common transformer-based model - it can also be applied to state-space models and achieves this without architectural modifications or additional parameters. Finally, we show that TREAD reduces computational cost and simultaneously boosts model performance on the standard ImageNet-256 benchmark in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 14x at 400K training iterations compared to DiT and 37x compared to the best benchmark performance of DiT at 7M training iterations. Furthermore, we achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting, which improves upon the DiT, without architectural changes.
Residual-based architectures, including transformers, can interpret outputs from preceding layers. We demonstrate this property (above), where the cosine similarity between the outputs of all layers in a trained network is shown. This characteristic can be leveraged either to cache layer outputs from previous timesteps for improved inference or to route tokens from one layer to deeper layers, as proposed in TREAD.
During each training step, a randomly chosen subset of tokens is “teleported” from an early layer i directly to a deeper layer j (a route i → j). Tokens that take the shortcut skip the self-attention and feed-forward modules in all layers between i and j, yielding lower FLOPs and higher training throughput. Because the routed tokens re-enter the network, no information is lost, and early blocks receive a deep-supervision signal when their activations are compared against late-stage objectives. In practice, routes that begin just after the start (i ≈ 2), span most of the depth (j ≈ B-4), and involve 50 % of tokens strike the best balance between speed-up and sample quality.
We report FID scores across different architectures trained for 400K iterations on ImageNet-256. Our method consistently improves FID across all backbone sizes.
On ImageNet-256 (class-conditional), DiT-XL/2 + TREAD reaches an unguided FID of 3.93 after 400K updates, which is 14x faster than the baseline and 37x faster than DiT's published best checkpoint at 7M steps. Finetuning without routing achieves a final guided FID of 2.09. Furthermore, the same routing idea transfers to state-space models (e.g., RWKV), to 512² resolution, and to the text-to-image benchmark MS-COCO. Additionally, we find it stacks with representation distillation methods like REPA.
Furthermore, we test scaling behavior with increasing layer number (from a DiT-B/2 with 12 layers to a DiT-XL/2 with 28 layers). We find that absolute scaling (i=2, j=B-4) works best for all tested models, while relative scaling shows little performance gain. From this we conclude that route placement is crucial but can be solved with the provided guidelines.
@article{krause2025tread,
title={TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training},
author={Krause, Felix and Phan, Timy and Gui, Ming and Baumann, Stefan Andreas and Hu, Vincent Tao and Ommer,
Bj{\"o}rn},
journal={arXiv preprint arXiv:2501.04765},
year={2025}
}