🧹 CleanDIFT: Diffusion Features without Noise

CompVis @ LMU Munich
*Indicates Equal Contribution

TL;DR: Diffusion models learn powerful world representations that have proven valuable for tasks like semantic correspondence detection, depth estimation, semantic segmentation, and classification. However, diffusion models require noisy input images, which destroys information and introduces the noise level as a hyperparameter that needs to be tuned for each task. We propose a novel method to extract noise-free, timestep-independent features by enabling diffusion models to work directly with clean input images. Our approach is efficient, training on a single GPU in just 30 minutes.

Overview

Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.

Clean Features → Clean Predictions

We evaluate our features on a wide range of downstream tasks: unsupervised zero-shot semantic correspondence, monocular depth estimation, semantic segmentation, and classification. We compare our features against standard diffusion features, methods that combine diffusion features with additional features, and non-diffusion-based approaches.

Input Image

Depth Estimation

Input Image

Semantic Segmentation

Input Image
Input Image
Input Image
Input Image
Input Image
Input Image

We compare Depth Estimation and Semantic Segmentation using linear probes on standard diffusion features and our CleanDIFT features. Note how the CleanDIFT features are far less noisy when compared to the standard diffusion features. Depth probes are trained on NYUv2 dataset, Segmentation probes on PASCAL VOC. Standard diffusion features use t=100 for Semantic Segmentation and t=300 for depth prediction.

Zero-Shot Semantic Correspondence matching using DIFT features with standard SD 2.1 (t=261) and our CleanDIFT features. Our clean features show significantly less incorrect matches than the standard diffusion features.

How it works

We train our feature extraction model to match the diffusion model's internal representations. We initialize the feature extraction model as a trainable copy of the diffusion model. Crucially, the feature extraction model is given the clean input image, while the diffusion model receives the noisy image and the corresponding timestep as input. Our goal is to obtain a single, noise-free feature map from the feature extraction model that consolidates the information of the diffusion model's timestep-dependent internal representations into a single one. To align our model's representations with the timestep-dependent diffusion model features during training, we introduce point-wise timestep-conditioned feature projection heads. The feature maps predicted by these projection heads are then aligned to the diffusion model's features. For feature extraction at inference time, we usually discard the projection heads and directly use the feature extraction model's internal representations. However, the projection heads can also be used to efficiently obtain feature maps for specific timesteps by reusing the feature extraction model's internal representations and passing them through the projection heads for different t values.

Quantitative Comparison

Zero-Shot Semantic Correspondence

Zero-shot unsupervised semantic correspondence matching performance comparison on SPair-71k. Our improved features consistently lead to substantial improvements in matching performance. Numbers show our reproductions.

We evaluate semantic correspondence matching accuracy for different noise levels. Our feature extractor outperforms the standard noisy diffusion features across all timesteps t. We additionally demonstrate that simply providing the diffusion model with a clean image and a non-zero timestep does not result in improved performance.

Monocular Depth Estimation

We evaluate metric depth prediction on NYUv2 using a linear probe. Our clean features outperform the noisy features by a significant margin. Probes trained on the noisy features can be reused for the clean features, but incur a smaller performance gain.

Semantic Segmentation

Performance on semantic segmentation for the PASCAL VOC dataset using linear probes. Our clean features outperform the noisy diffusion features for the best noising timestep t. Semantic segmentation performance of a standard diffusion model heavily depends on the used noising timestep. Unlike for semantic correspondence matching, the optimal t value appears to be around t=100.

Classification

Classification performance on ImageNet1k, using kNN classifier with k=10 and cosine similarity as the distance metric. We sweep over different timesteps and feature maps. We find that the feature map with the lowest spatial resolution (feature map #0) yields the highest classification accuracy.

BibTeX


        @misc{stracke2024cleandiftdiffusionfeaturesnoise,
          title={CleanDIFT: Diffusion Features without Noise}, 
          author={Nick Stracke and Stefan Andreas Baumann and Kolja Bauer and Frank Fundel and Björn Ommer},
          year={2024},
          eprint={2412.03439},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2412.03439}, 
        }