Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I load the pretrained weights of google/vit-large-patch32-224-in21k into VisionTransformer? #510

Closed
NightMachinery opened this issue Apr 26, 2023 · 6 comments

Comments

@NightMachinery
Copy link

How do I load the pretrained weights of google/vit-large-patch32-224-in21k into VisionTransformer?

I want to do some research on ViT, and I need my changes to be visible in both ViT and CLIP's ViT.

Should I load the model weights using HuggingFace, and then manually update the state dict of OpenCLIP's VisionTransformer using these weights?

@rwightman
Copy link
Collaborator

@NightMachinery so google/vit-large-patch32-224-in21k is an ImageNet-21k pretrained vit that's in HF transformers vit form, you want to do LIT style pretrain and load the vision tower with those weights?

OpenCLIP doesn't have support for HF image towers right now, only text towers. It does support timm image towers though and the same weights exist in timm as https://huggingface.co/timm/vit_large_patch32_224.orig_in21k (need to update the card still, but they are the same).

You can create a model config like https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

There has been some work to allow more flexible loading of arbitrary local checkpoints into the towers #333 and #255 ... but unfortunately they are out of date lost in shuffle of many changes and someone needs to spend some time to redo that effort cleanly on existing code...

@rwightman
Copy link
Collaborator

rwightman commented Apr 26, 2023

BTW, not sure if you specifically want the patch32 large, but the later augreg (How to Train Your ViT) are MUCH better ImageNet-21k pretrained models although they never did the L/32 config

The newer B/16 is significantly better than the older L/32 but more compute. Even the newer B/32 and S/16 are better and less compute.

https://huggingface.co/timm/vit_base_patch16_224.augreg_in21k
https://huggingface.co/timm/vit_small_patch16_224.augreg_in21k
https://huggingface.co/timm/vit_base_patch32_224.augreg_in21k

@NightMachinery
Copy link
Author

NightMachinery commented Apr 26, 2023

@NightMachinery so google/vit-large-patch32-224-in21k is an ImageNet-21k pretrained vit that's in HF transformers vit form, you want to do LIT style pretrain and load the vision tower with those weights?

OpenCLIP doesn't have support for HF image towers right now, only text towers. It does support timm image towers though and the same weights exist in timm as https://huggingface.co/timm/vit_large_patch32_224.orig_in21k (need to update the card still, but they are the same).

You can create a model config like main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

There has been some work to allow more flexible loading of arbitrary local checkpoints into the towers #333 and #255 ... but unfortunately they are out of date lost in shuffle of many changes and someone needs to spend some time to redo that effort cleanly on existing code...

Thanks for the tip on "How to Train Your ViT," I indeed only want some transformer-only (no conv) ViT, it doesn't matter much which specific setup.

My goal is not to pretrain an image tower. I want to implement a new post-hoc interpretability technique on Vision Transformers, and I want to avoid duplicating this interpretability code two times for a normal ViT and a CLIP image tower.

My understanding of the timm wrapper is that it uses the timm model's code. So if I modify timm, I can successfully work with normal (ImageNet classifier) ViTs, but I would not be able to load a CLIP image tower that uses this timm code, correct?

To repeat myself; I want to change the transformer code and see this change on weights of both timm/vit_base_patch16_224.augreg_in21k and laion/CLIP-ViT-B-32-laion2B-s34B-b79K. I ideally want to only change the code in one place, and avoid duplicating the changes on two codebases. This is mostly because the changes are experimental and I want to be able to easily experiment and see the results on both ImageNet classifiers and contrastive CLIP image towers.

I do not want to train/finetune any weights.

Thanks again.

@rwightman
Copy link
Collaborator

rwightman commented Apr 26, 2023

@NightMachinery timm can load the CLIP image tower weights for the main openai and laion2b checkpoints

https://github.com/huggingface/pytorch-image-models/blob/9ee846ff0cbbc05a99b45140aa6d84083bcf6488/timm/models/vision_transformer.py#L1188-L1225

so yes, it is possible to just use the timm backbone code for the image tower, otherwise yeah, you'll have to make the same mods to both timm and the builtin OpenCLIP transformers and they are different styles...

@NightMachinery
Copy link
Author

You can create a model config like main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

I have taken a look at the code:

visual = TimmModel(
    vision_cfg.timm_model_name,
    pretrained=vision_cfg.timm_model_pretrained,
    pool=vision_cfg.timm_pool,
    proj=vision_cfg.timm_proj,
    proj_bias=vision_cfg.timm_proj_bias,
    drop=vision_cfg.timm_drop,
    drop_path=vision_cfg.timm_drop_path,
    embed_dim=embed_dim,
    image_size=vision_cfg.image_size,
)
act_layer = nn.GELU  # so that text transformer doesn't use QuickGELU w/ timm models

I don't understand why the activation has been set to GELU? Shouldn't timm's clip models like vit_base_patch16_clip_224.laion2b use QuickGELU?


I also don't understand how open_clip knows which text encoder to load. In the config you have linked:

    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 512,
        "heads": 8,
        "layers": 12
    }

There is no hub ID. So which pretrained weights are loaded?

@rwightman
Copy link
Collaborator

@NightMachinery QuickGELU is a legacy activation from the OpenAI models, and one mistake on the initial laion400m B/32 that used it instead of nn.GELU. QuickGELU is not 'quick' vs the native nn.GELU, it is slower and uses more memory since it's not a fused kernel, a matching approximation was never added to PyTorch (they only added tanh approx, not sigmoid).

If the text config doesn't specify a HF model it's using the builtin BERT style text encoder, pretrained weights are loaded from a CLIP trained model, not from the original text model as there are none (the builin text encoders are only trained from scratch via image-text training). The HF text towers support loading from original pretrained weights...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants