Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to align the dino feature and 512x512 image latents, the patch num is different #26

Open
JacobKong opened this issue Nov 26, 2024 · 3 comments

Comments

@JacobKong
Copy link

JacobKong commented Nov 26, 2024

Hi, there

Great work. However, I am wondering if the diffusion model is trained on 512x512 or even larger picture size, how to align the projected feature with dino feature(which is 224x224->16*16), the the patch number is different.

So should I downsample the projected feature to calculate the project loss?

Best regard.

@JacobKong JacobKong changed the title How to align the dino feature and 512x512 image latents, the patch size is different How to align the dino feature and 512x512 image latents, the patch num is different Nov 26, 2024
@dengchcs
Copy link

They upsample the images before feeding them to dinov2. It's mentioned in their openreview revision file.

@PanXiebit
Copy link

They upsample the images before feeding them to dinov2. It's mentioned in their openreview revision file.

Can dinov2 process images with higher resolution? Is it possible to upsample after extracting features?

@dengchcs
Copy link

They upsample the images before feeding them to dinov2. It's mentioned in their openreview revision file.

Can dinov2 process images with higher resolution? Is it possible to upsample after extracting features?

hi, I'm not familiar with dino. I guess you can just resize the images before using dino to extract features, as suggested in the code.

REPA/train.py

Line 50 in 80ee742

x = torch.nn.functional.interpolate(x, 224, mode='bicubic')

Also, we may need to interpolate the positional embeddings as suggested here:

REPA/utils.py

Line 86 in 80ee742

encoder.pos_embed.data = timm.layers.pos_embed.resample_abs_pos_embed(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants