We evaluate our features on a wide range of downstream tasks: unsupervised zero-shot semantic correspondence, monocular depth estimation, semantic segmentation, and classification. We compare our features against standard diffusion features, methods that combine diffusion features with additional features, and non-diffusion-based approaches.
Input Image
Depth Estimation
Input Image
Semantic Segmentation
We compare Depth Estimation and Semantic Segmentation using linear probes on standard diffusion features and our CleanDIFT features. Note how the CleanDIFT features are far less noisy when compared to the standard diffusion features. Depth probes are trained on NYUv2 dataset, Segmentation probes on PASCAL VOC. Standard diffusion features use t=100 for Semantic Segmentation and t=300 for depth prediction.
Zero-Shot Semantic Correspondence matching using DIFT features with standard SD 2.1 (t=261) and our CleanDIFT features. Our clean features show significantly less incorrect matches than the standard diffusion features.