Loading only text/image model #721

nicolas-dufour · 2023-10-28T14:44:50Z

nicolas-dufour
Oct 28, 2023

Hi,

I cannot find a way to load only the text or image model?
Is there any efficient way to do this in open clip (similar to CLIPTextModel)

Thanks!

rwightman · 2023-10-28T21:33:14Z

rwightman
Oct 28, 2023
Maintainer

@nicolas-dufour the best way right now is to create the full model and extract the vision or text tower

vision:

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP-i18n-256')
model = model.visual
# model.cuda()   # move to hardware here
image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
output = model(preprocess(image).unsqueeze(0))
output.shape
>>> torch.Size([1, 768])

text:

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
import open_clip
model, preprocess = open_clip.create_model_from_pretrained(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k',
    force_custom_text=True,  # this forces a model that has the '.text' attr
)
model = model.text
tokenizer = open_clip.get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
text = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)
output = model(text)
output.shape
>>> torch.Size([4, 512])

0 replies

rwightman · 2023-10-28T21:47:41Z

rwightman
Oct 28, 2023
Maintainer

Also not particularly documented but useful, can output the token sequences, could modify above with

vision

model.output_tokens = True  # currently can't be timm model, will fix this
output = model(preprocess(image).unsqueeze(0))
for o in output:
    print(o.shape)
  
>>> torch.Size([1, 512])
>>> torch.Size([1, 49, 768

text

model.output_tokens = True
output = model(text)
for o in output:
    print(o.shape)
>>> torch.Size([4, 512])
>>> torch.Size([4, 77, 512])

EDIT:
If the model is a timm model, you have to do this to get the token sequence

model.trunk.forward_features(preprocess(image).unsqueeze(0))

or

output = model.trunk.get_intermediate_layers(preprocess(image).unsqueeze(0), n=4)

0 replies

rwightman · 2023-10-28T21:55:22Z

rwightman
Oct 28, 2023
Maintainer

So the HF transformers approach is a little bit more efficient, it only creates the vision or text tower. However both approaches have to download and read the full checkpoint file.

What's being skipped in the transformers case is instantiating the other tower on CPU and copying the parameters from the checkpoint that was loaded into memory. In this approach both get loaded, then the model = model.visual ( or text) will discard one, which will get cleaned up by garbage collector, and then we can move to the GPU, etc.

0 replies

rwightman · 2023-10-28T22:07:14Z

rwightman
Oct 28, 2023
Maintainer

@nicolas-dufour I'm going to move to dicussion for others to reference, will think about adding a mechanism to do this without first loading the unused tower, but on complexity vs simplicity tradeoffs most of the models here aren't so huge that there's an unacceptable penalty to loading on cpu first, discarding part, and then moving to a GPU/TPU/etc

0 replies

nicolas-dufour · 2023-10-30T10:43:35Z

nicolas-dufour
Oct 30, 2023
Author

Thanks for the pointer, was thinking hugging face didn't need the full checkpoint, useful to know!
The main usecase of having separate classes would be to do feature extraction and have specialized functions per tower (get the hidden states for example)

1 reply

rwightman Oct 30, 2023
Maintainer

Yes, examples above cover exactly that use case (feature extraction). Could be made a bit smoother and more obvious but is possible to get it working as is...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading only text/image model #721

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Loading only text/image model #721

nicolas-dufour Oct 28, 2023

Replies: 5 comments · 1 reply

rwightman Oct 28, 2023 Maintainer

rwightman Oct 28, 2023 Maintainer

rwightman Oct 28, 2023 Maintainer

rwightman Oct 28, 2023 Maintainer

nicolas-dufour Oct 30, 2023 Author

rwightman Oct 30, 2023 Maintainer

nicolas-dufour
Oct 28, 2023

Replies: 5 comments 1 reply

rwightman
Oct 28, 2023
Maintainer

rwightman
Oct 28, 2023
Maintainer

rwightman
Oct 28, 2023
Maintainer

rwightman
Oct 28, 2023
Maintainer

nicolas-dufour
Oct 30, 2023
Author

rwightman Oct 30, 2023
Maintainer