Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing docuementation for encode for image embedding models #3118

Open
KennethEnevoldsen opened this issue Dec 4, 2024 · 1 comment
Open

Comments

@KennethEnevoldsen
Copy link

KennethEnevoldsen commented Dec 4, 2024

I can't seem to find the documentation for encode when encoding images:

model = SentenceTransformer('clip-ViT-B-32') #Load CLIP model
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) # no documentation for this step

I am asking for this because we want to make a compatible interface for image embeddings for mteb.

We are also working on the multimodal interface (e.g. for models like https://huggingface.co/TIGER-Lab/VLM2Vec-Full).

@tomaarsen
Copy link
Collaborator

Hello!

Indeed, this is not documented very nicely because I'm considering deprecating the current CLIPModel module in favor of making the much more common Transformer module multimodal.

I did some experiments with this today, and I think there's potential. We would move towards AutoProcessing instead of AutoTokenizer. We can then feed the tokenizer/processor/feature extractor, etc., with whatever inputs the user has, and then feed that directly into the model.

We do then have to be careful what the model returns. For text-based models, we always grab the last_hidden_state and then do Pooling in a separate pooler module, but with multi-modal systems (CLIP, CLAP) it seems to be more common to rely on the model's own pooling. This certainly simplifies things as we otherwise have to feed multiple token/patch embeddings to the pooler, sometimes even with different dimensionalities, etc.

I have to be quite wary as I rely on transformers fully here.

Either way, the interface will always remain the same, regardless of how it's implemented behind the scenes, and your snippet is correct, you can pass PIL.Image instances to model.encode.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants