How do I load the pretrained weights of `google/vit-large-patch32-224-in21k` into `VisionTransformer`? #510

NightMachinery · 2023-04-26T16:15:52Z

How do I load the pretrained weights of google/vit-large-patch32-224-in21k into VisionTransformer?

I want to do some research on ViT, and I need my changes to be visible in both ViT and CLIP's ViT.

Should I load the model weights using HuggingFace, and then manually update the state dict of OpenCLIP's VisionTransformer using these weights?

The text was updated successfully, but these errors were encountered:

rwightman · 2023-04-26T17:15:24Z

@NightMachinery so google/vit-large-patch32-224-in21k is an ImageNet-21k pretrained vit that's in HF transformers vit form, you want to do LIT style pretrain and load the vision tower with those weights?

OpenCLIP doesn't have support for HF image towers right now, only text towers. It does support timm image towers though and the same weights exist in timm as https://huggingface.co/timm/vit_large_patch32_224.orig_in21k (need to update the card still, but they are the same).

You can create a model config like https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

There has been some work to allow more flexible loading of arbitrary local checkpoints into the towers #333 and #255 ... but unfortunately they are out of date lost in shuffle of many changes and someone needs to spend some time to redo that effort cleanly on existing code...

rwightman · 2023-04-26T17:25:48Z

BTW, not sure if you specifically want the patch32 large, but the later augreg (How to Train Your ViT) are MUCH better ImageNet-21k pretrained models although they never did the L/32 config

The newer B/16 is significantly better than the older L/32 but more compute. Even the newer B/32 and S/16 are better and less compute.

https://huggingface.co/timm/vit_base_patch16_224.augreg_in21k
https://huggingface.co/timm/vit_small_patch16_224.augreg_in21k
https://huggingface.co/timm/vit_base_patch32_224.augreg_in21k

NightMachinery · 2023-04-26T17:54:56Z

@NightMachinery so google/vit-large-patch32-224-in21k is an ImageNet-21k pretrained vit that's in HF transformers vit form, you want to do LIT style pretrain and load the vision tower with those weights?

OpenCLIP doesn't have support for HF image towers right now, only text towers. It does support timm image towers though and the same weights exist in timm as https://huggingface.co/timm/vit_large_patch32_224.orig_in21k (need to update the card still, but they are the same).

You can create a model config like main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

There has been some work to allow more flexible loading of arbitrary local checkpoints into the towers #333 and #255 ... but unfortunately they are out of date lost in shuffle of many changes and someone needs to spend some time to redo that effort cleanly on existing code...

Thanks for the tip on "How to Train Your ViT," I indeed only want some transformer-only (no conv) ViT, it doesn't matter much which specific setup.

My goal is not to pretrain an image tower. I want to implement a new post-hoc interpretability technique on Vision Transformers, and I want to avoid duplicating this interpretability code two times for a normal ViT and a CLIP image tower.

My understanding of the timm wrapper is that it uses the timm model's code. So if I modify timm, I can successfully work with normal (ImageNet classifier) ViTs, but I would not be able to load a CLIP image tower that uses this timm code, correct?

To repeat myself; I want to change the transformer code and see this change on weights of both timm/vit_base_patch16_224.augreg_in21k and laion/CLIP-ViT-B-32-laion2B-s34B-b79K. I ideally want to only change the code in one place, and avoid duplicating the changes on two codebases. This is mostly because the changes are experimental and I want to be able to easily experiment and see the results on both ImageNet classifiers and contrastive CLIP image towers.

I do not want to train/finetune any weights.

Thanks again.

rwightman · 2023-04-26T20:13:48Z

@NightMachinery timm can load the CLIP image tower weights for the main openai and laion2b checkpoints

https://github.com/huggingface/pytorch-image-models/blob/9ee846ff0cbbc05a99b45140aa6d84083bcf6488/timm/models/vision_transformer.py#L1188-L1225

so yes, it is possible to just use the timm backbone code for the image tower, otherwise yeah, you'll have to make the same mods to both timm and the builtin OpenCLIP transformers and they are different styles...

NightMachinery · 2023-05-12T18:47:01Z

You can create a model config like main/src/open_clip/model_configs/vit_medium_patch16_gap_256.json but with the model you want and set the timm pretrained flag to True to load the weights from timm on init of the CLIP model

I have taken a look at the code:

visual = TimmModel(
    vision_cfg.timm_model_name,
    pretrained=vision_cfg.timm_model_pretrained,
    pool=vision_cfg.timm_pool,
    proj=vision_cfg.timm_proj,
    proj_bias=vision_cfg.timm_proj_bias,
    drop=vision_cfg.timm_drop,
    drop_path=vision_cfg.timm_drop_path,
    embed_dim=embed_dim,
    image_size=vision_cfg.image_size,
)
act_layer = nn.GELU  # so that text transformer doesn't use QuickGELU w/ timm models

I don't understand why the activation has been set to GELU? Shouldn't timm's clip models like vit_base_patch16_clip_224.laion2b use QuickGELU?

I also don't understand how open_clip knows which text encoder to load. In the config you have linked:

    "text_cfg": {
        "context_length": 77,
        "vocab_size": 49408,
        "width": 512,
        "heads": 8,
        "layers": 12
    }

There is no hub ID. So which pretrained weights are loaded?

rwightman · 2023-05-12T19:03:10Z

@NightMachinery QuickGELU is a legacy activation from the OpenAI models, and one mistake on the initial laion400m B/32 that used it instead of nn.GELU. QuickGELU is not 'quick' vs the native nn.GELU, it is slower and uses more memory since it's not a fused kernel, a matching approximation was never added to PyTorch (they only added tanh approx, not sigmoid).

If the text config doesn't specify a HF model it's using the builtin BERT style text encoder, pretrained weights are loaded from a CLIP trained model, not from the original text model as there are none (the builin text encoders are only trained from scratch via image-text training). The HF text towers support loading from original pretrained weights...

NightMachinery closed this as completed Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I load the pretrained weights of `google/vit-large-patch32-224-in21k` into `VisionTransformer`? #510

How do I load the pretrained weights of `google/vit-large-patch32-224-in21k` into `VisionTransformer`? #510

NightMachinery commented Apr 26, 2023

rwightman commented Apr 26, 2023

rwightman commented Apr 26, 2023 •

edited

Loading

NightMachinery commented Apr 26, 2023 •

edited

Loading

rwightman commented Apr 26, 2023 •

edited

Loading

NightMachinery commented May 12, 2023

rwightman commented May 12, 2023

How do I load the pretrained weights of google/vit-large-patch32-224-in21k into VisionTransformer? #510

How do I load the pretrained weights of google/vit-large-patch32-224-in21k into VisionTransformer? #510

Comments

NightMachinery commented Apr 26, 2023

rwightman commented Apr 26, 2023

rwightman commented Apr 26, 2023 • edited Loading

NightMachinery commented Apr 26, 2023 • edited Loading

rwightman commented Apr 26, 2023 • edited Loading

NightMachinery commented May 12, 2023

rwightman commented May 12, 2023

How do I load the pretrained weights of `google/vit-large-patch32-224-in21k` into `VisionTransformer`? #510

How do I load the pretrained weights of `google/vit-large-patch32-224-in21k` into `VisionTransformer`? #510

rwightman commented Apr 26, 2023 •

edited

Loading

NightMachinery commented Apr 26, 2023 •

edited

Loading

rwightman commented Apr 26, 2023 •

edited

Loading