-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode using FP16 #822
Comments
This is exactly what is mentioned in #79 I guess, but there is not clear answer to that as far as I understand. |
I did not observe any speed improvements when converting the model to FP16. What you can do to convert the embeddings is:
This will store the embeddings in FP16 and reduce the size you need on disc. |
Thank you very much for the answer. This is exactly what I was trying, thinking I was misunderstanding something since I get an error when trying to call util.pytorch_cos_sim E.g.:
I'm getting:
Any hints are appreciated |
In the just released version 1.0.0 there is a new parameter to the encode function:
You normalize the embeddings, then convert them to FP16. Then you can use the dot product (dot_score) instead of cosine similarity. |
Very nice! I will try that. Cheers! |
That worked perfectly. Sadly the dot product for FP16 is like x10 times slower than FP32, so it's rather unusable. I realize this is a Torch limitation, but any tips or known workarounds are appreciated. |
I add a FC layer to make the output embedding much smaller, then finetune the whole model(my FC layer and the BERT) in my downstream task. It works very well. |
I'm using the 2.2.0 version and followed the suggestions to reduce precision. However, when I try the |
As for my try, the |
@nreimers Hey quick question. If we the train the biencoder with fp=16 mixed precision training (use_amp=True in .fit), then during inference, is it okay to just use model.encode() (which does everything in fp32 as opposed to in fp16) ? |
Yes |
How to convert a model to fp16 (thus reducing it's size)? clip-ViT-L-14 is 1.7G in fp32 and it'd be great to reduce the size as it takes too much memory when using inference in parallel |
I was not able to find a direct method of loading a model in fp16, but found a hacky workaround using pytorch's # Half precision for inference mode - this is a bit of a hack, but it works
from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer(model_name)
for module in bi_encoder.modules:
module.half() I didn't see a performance drop in my evaluation script. |
That's not very hacky at all, in my opinion. The following should also work: from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer(model_name)
bi_encoder.half()
embeddings = bi_encoder.encode(...) In an upcoming version, after #2578, you'll be able to pass
|
Hi, I was wondering what the correct usage for but got an error: |
Solved it by referring to #2889 Thanks! |
Hi,
Is it possible to encode a text and store it in FP16? I need to store a large number of encoded vectors, taking approx. 2gb in memory and disk so it would be great to be able to reduce that.
Using .half() on my vectors result on this error "clamp_min_cpu" not implemented for 'Half'
The text was updated successfully, but these errors were encountered: