-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I load the pretrained weights of google/vit-large-patch32-224-in21k
into VisionTransformer
?
#510
Comments
@NightMachinery so OpenCLIP doesn't have support for HF image towers right now, only text towers. It does support You can create a model config like https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model There has been some work to allow more flexible loading of arbitrary local checkpoints into the towers #333 and #255 ... but unfortunately they are out of date lost in shuffle of many changes and someone needs to spend some time to redo that effort cleanly on existing code... |
BTW, not sure if you specifically want the patch32 large, but the later augreg (How to Train Your ViT) are MUCH better ImageNet-21k pretrained models although they never did the L/32 config The newer B/16 is significantly better than the older L/32 but more compute. Even the newer B/32 and S/16 are better and less compute. https://huggingface.co/timm/vit_base_patch16_224.augreg_in21k |
Thanks for the tip on "How to Train Your ViT," I indeed only want some transformer-only (no conv) ViT, it doesn't matter much which specific setup. My goal is not to pretrain an image tower. I want to implement a new post-hoc interpretability technique on Vision Transformers, and I want to avoid duplicating this interpretability code two times for a normal ViT and a CLIP image tower. My understanding of the timm wrapper is that it uses the timm model's code. So if I modify timm, I can successfully work with normal (ImageNet classifier) ViTs, but I would not be able to load a CLIP image tower that uses this timm code, correct? To repeat myself; I want to change the transformer code and see this change on weights of both I do not want to train/finetune any weights. Thanks again. |
@NightMachinery timm can load the CLIP image tower weights for the main openai and laion2b checkpoints so yes, it is possible to just use the timm backbone code for the image tower, otherwise yeah, you'll have to make the same mods to both timm and the builtin OpenCLIP transformers and they are different styles... |
I have taken a look at the code:
I don't understand why the activation has been set to GELU? Shouldn't timm's clip models like I also don't understand how open_clip knows which text encoder to load. In the config you have linked:
There is no hub ID. So which pretrained weights are loaded? |
@NightMachinery QuickGELU is a legacy activation from the OpenAI models, and one mistake on the initial laion400m B/32 that used it instead of nn.GELU. QuickGELU is not 'quick' vs the native nn.GELU, it is slower and uses more memory since it's not a fused kernel, a matching approximation was never added to PyTorch (they only added tanh approx, not sigmoid). If the text config doesn't specify a HF model it's using the builtin BERT style text encoder, pretrained weights are loaded from a CLIP trained model, not from the original text model as there are none (the builin text encoders are only trained from scratch via image-text training). The HF text towers support loading from original pretrained weights... |
How do I load the pretrained weights of
google/vit-large-patch32-224-in21k
intoVisionTransformer
?I want to do some research on ViT, and I need my changes to be visible in both ViT and CLIP's ViT.
Should I load the model weights using HuggingFace, and then manually update the state dict of OpenCLIP's VisionTransformer using these weights?
The text was updated successfully, but these errors were encountered: