Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode using FP16 #822

Open
fcggamou opened this issue Mar 18, 2021 · 16 comments
Open

Encode using FP16 #822

fcggamou opened this issue Mar 18, 2021 · 16 comments

Comments

@fcggamou
Copy link

Hi,

Is it possible to encode a text and store it in FP16? I need to store a large number of encoded vectors, taking approx. 2gb in memory and disk so it would be great to be able to reduce that.

Using .half() on my vectors result on this error "clamp_min_cpu" not implemented for 'Half'

@fcggamou
Copy link
Author

This is exactly what is mentioned in #79 I guess, but there is not clear answer to that as far as I understand.

@nreimers
Copy link
Member

I did not observe any speed improvements when converting the model to FP16.

What you can do to convert the embeddings is:

emb = model.encode(sentences, convert_to_tensor=True).half()

This will store the embeddings in FP16 and reduce the size you need on disc.

@fcggamou
Copy link
Author

fcggamou commented Mar 18, 2021

Thank you very much for the answer. This is exactly what I was trying, thinking I was misunderstanding something since I get an error when trying to call util.pytorch_cos_sim

E.g.:

embedding_1 = model.encode(text1, convert_to_tensor=True).half()
embedding_2 = model.encode(text2, convert_to_tensor=True).half()
util.pytorch_cos_sim(embedding_1, embedding_2)

I'm getting:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-66-6cb5318b96e9> in <module>()
      3 embedding_1 = model.encode(text, convert_to_tensor=True).half()
      4 embedding_2 = model.encode(text, convert_to_tensor=True).half()
----> 5 util.pytorch_cos_sim(embedding_1, embedding_2)
      6 

1 frames
/usr/local/lib/python3.7/dist-packages/sentence_transformers/util.py in pytorch_cos_sim(a, b)
     33         b = b.unsqueeze(0)
     34 
---> 35     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     36     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
     37     return torch.mm(a_norm, b_norm.transpose(0, 1))

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in normalize(input, p, dim, eps, out)
   4268         return handle_torch_function(normalize, (input,), input, p=p, dim=dim, eps=eps, out=out)
   4269     if out is None:
-> 4270         denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input)
   4271         return input / denom
   4272     else:

RuntimeError: "clamp_min_cpu" not implemented for 'Half'

Any hints are appreciated

@nreimers
Copy link
Member

In the just released version 1.0.0 there is a new parameter to the encode function:

emb1 = model.encode(text1, convert_to_tensor=True, normalize_embeddings=True).half()
emb2= model.encode(text2, convert_to_tensor=True, normalize_embeddings=True).half()
scores = util.dot_score(emb1, emb2)

You normalize the embeddings, then convert them to FP16. Then you can use the dot product (dot_score) instead of cosine similarity.

@fcggamou
Copy link
Author

Very nice! I will try that.

Cheers!

@fcggamou
Copy link
Author

That worked perfectly. Sadly the dot product for FP16 is like x10 times slower than FP32, so it's rather unusable.

I realize this is a Torch limitation, but any tips or known workarounds are appreciated.

@yechenzhi
Copy link

I add a FC layer to make the output embedding much smaller, then finetune the whole model(my FC layer and the BERT) in my downstream task. It works very well.

@foadyousef
Copy link

I'm using the 2.2.0 version and followed the suggestions to reduce precision. However, when I try the util.semantic_search, I still get the same original error!
RuntimeError: "clamp_min_cpu" not implemented for 'Half'

@Wunaiq
Copy link

Wunaiq commented Aug 18, 2022

As for my try, the RuntimeError: "clamp_min_cpu" not implemented for 'Half' is because that the function doesn't support FP16 on CPU, but it will work well on GPU when use FP16. So try to move the operation onto GPU.

@sidhantls
Copy link
Contributor

sidhantls commented Sep 1, 2022

@nreimers Hey quick question. If we the train the biencoder with fp=16 mixed precision training (use_amp=True in .fit), then during inference, is it okay to just use model.encode() (which does everything in fp32 as opposed to in fp16) ?

@nreimers
Copy link
Member

nreimers commented Sep 1, 2022

Yes

@rossbg
Copy link

rossbg commented Aug 4, 2023

How to convert a model to fp16 (thus reducing it's size)? clip-ViT-L-14 is 1.7G in fp32 and it'd be great to reduce the size as it takes too much memory when using inference in parallel

@sorenmc
Copy link

sorenmc commented Apr 25, 2024

I was not able to find a direct method of loading a model in fp16, but found a hacky workaround using pytorch's .half() method

# Half precision for inference mode - this is a bit of a hack, but it works
from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer(model_name)
for module in bi_encoder.modules:
    module.half()

I didn't see a performance drop in my evaluation script.

@tomaarsen
Copy link
Collaborator

That's not very hacky at all, in my opinion. The following should also work:

from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer(model_name)
bi_encoder.half()

embeddings = bi_encoder.encode(...)

In an upcoming version, after #2578, you'll be able to pass torch_dtype to the model and directly load the model in your desired precision.

  • Tom Aarsen

@shizhediao
Copy link

Hi,

I was wondering what the correct usage for model_kwargs is.
I tried this:
model = SentenceTransformer(dunzhang/stella_en_400M_v5, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})

but got an error:
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float

@shizhediao
Copy link

Solved it by referring to #2889

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants