CompVis Group @ LMU Munich
* Equal Contribution
This repository contains the official implementation of the paper "CleanDIFT: Diffusion Features without Noise".
We propose CleanDIFT, a novel method to extract noise-free, timestep-independent features by enabling diffusion models to work directly with clean input images. Our approach is efficient, training on a single GPU in just 30 minutes.
Just clone the repo and install the requirements via pip install -r requirements.txt
, then you're ready to go.
In order to train a feature extractor on your own, you can run python train.py
. The training script expects your data to be stored in ./data
with the following format: Single level directory with images named filename.jpg
and corresponding json files filename.json
that contain the key caption
.
For feature extraction, please refer to one of the notebooks at notebooks
. We demonstrate how to extract features and use them for semantic correspondence detection and depth prediction.
Our checkpoints are fully compatible with the diffusers
library. If you already have a pipeline using SD 1.5 or SD 2.1 from diffusers
, you can simply replace the U-Net state dict:
from diffusers import UNet2DConditionModel
from huggingface_hub import hf_hub_download
unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-1", subfolder="unet")
ckpt_pth = hf_hub_download(repo_id="CompVis/cleandift", filename="cleandift_sd21_unet.safetensors")
state_dict = load_file(ckpt_pth)
unet.load_state_dict(state_dict, strict=True)
If you use this codebase or otherwise found our work valuable, please cite our paper:
@misc{stracke2024cleandiftdiffusionfeaturesnoise,
title={CleanDIFT: Diffusion Features without Noise},
author={Nick Stracke and Stefan Andreas Baumann and Kolja Bauer and Frank Fundel and Björn Ommer},
year={2024},
eprint={2412.03439},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03439},
}