Loading only text/image model #721
Replies: 5 comments 1 reply
-
@nicolas-dufour the best way right now is to create the full model and extract the vision or text tower vision: import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
import open_clip
model, preprocess = open_clip.create_model_from_pretrained('hf-hub:timm/ViT-B-16-SigLIP-i18n-256')
model = model.visual
# model.cuda() # move to hardware here
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
output = model(preprocess(image).unsqueeze(0))
output.shape
>>> torch.Size([1, 768]) text: import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
import open_clip
model, preprocess = open_clip.create_model_from_pretrained(
'ViT-B-32',
pretrained='laion2b_s34b_b79k',
force_custom_text=True, # this forces a model that has the '.text' attr
)
model = model.text
tokenizer = open_clip.get_tokenizer('hf-hub:timm/ViT-SO400M-14-SigLIP-384')
text = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)
output = model(text)
output.shape
>>> torch.Size([4, 512]) |
Beta Was this translation helpful? Give feedback.
-
Also not particularly documented but useful, can output the token sequences, could modify above with vision
text
EDIT:
or
|
Beta Was this translation helpful? Give feedback.
-
So the HF transformers approach is a little bit more efficient, it only creates the vision or text tower. However both approaches have to download and read the full checkpoint file. What's being skipped in the transformers case is instantiating the other tower on CPU and copying the parameters from the checkpoint that was loaded into memory. In this approach both get loaded, then the model = model.visual ( or text) will discard one, which will get cleaned up by garbage collector, and then we can move to the GPU, etc. |
Beta Was this translation helpful? Give feedback.
-
@nicolas-dufour I'm going to move to dicussion for others to reference, will think about adding a mechanism to do this without first loading the unused tower, but on complexity vs simplicity tradeoffs most of the models here aren't so huge that there's an unacceptable penalty to loading on cpu first, discarding part, and then moving to a GPU/TPU/etc |
Beta Was this translation helpful? Give feedback.
-
Thanks for the pointer, was thinking hugging face didn't need the full checkpoint, useful to know! |
Beta Was this translation helpful? Give feedback.
-
Hi,
I cannot find a way to load only the text or image model?
Is there any efficient way to do this in open clip (similar to CLIPTextModel)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions