Encode using FP16 #822

fcggamou · 2021-03-18T18:30:37Z

Hi,

Is it possible to encode a text and store it in FP16? I need to store a large number of encoded vectors, taking approx. 2gb in memory and disk so it would be great to be able to reduce that.

Using .half() on my vectors result on this error "clamp_min_cpu" not implemented for 'Half'

fcggamou · 2021-03-18T18:37:36Z

This is exactly what is mentioned in #79 I guess, but there is not clear answer to that as far as I understand.

nreimers · 2021-03-18T19:21:49Z

I did not observe any speed improvements when converting the model to FP16.

What you can do to convert the embeddings is:

emb = model.encode(sentences, convert_to_tensor=True).half()

This will store the embeddings in FP16 and reduce the size you need on disc.

fcggamou · 2021-03-18T19:57:49Z

Thank you very much for the answer. This is exactly what I was trying, thinking I was misunderstanding something since I get an error when trying to call util.pytorch_cos_sim

E.g.:

embedding_1 = model.encode(text1, convert_to_tensor=True).half()
embedding_2 = model.encode(text2, convert_to_tensor=True).half()
util.pytorch_cos_sim(embedding_1, embedding_2)

I'm getting:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-66-6cb5318b96e9> in <module>()
      3 embedding_1 = model.encode(text, convert_to_tensor=True).half()
      4 embedding_2 = model.encode(text, convert_to_tensor=True).half()
----> 5 util.pytorch_cos_sim(embedding_1, embedding_2)
      6 

1 frames
/usr/local/lib/python3.7/dist-packages/sentence_transformers/util.py in pytorch_cos_sim(a, b)
     33         b = b.unsqueeze(0)
     34 
---> 35     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     36     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
     37     return torch.mm(a_norm, b_norm.transpose(0, 1))

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in normalize(input, p, dim, eps, out)
   4268         return handle_torch_function(normalize, (input,), input, p=p, dim=dim, eps=eps, out=out)
   4269     if out is None:
-> 4270         denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input)
   4271         return input / denom
   4272     else:

RuntimeError: "clamp_min_cpu" not implemented for 'Half'

Any hints are appreciated

nreimers · 2021-03-18T20:01:22Z

In the just released version 1.0.0 there is a new parameter to the encode function:

emb1 = model.encode(text1, convert_to_tensor=True, normalize_embeddings=True).half()
emb2= model.encode(text2, convert_to_tensor=True, normalize_embeddings=True).half()
scores = util.dot_score(emb1, emb2)

You normalize the embeddings, then convert them to FP16. Then you can use the dot product (dot_score) instead of cosine similarity.

fcggamou · 2021-03-18T20:02:15Z

Very nice! I will try that.

Cheers!

fcggamou · 2021-03-19T12:54:04Z

That worked perfectly. Sadly the dot product for FP16 is like x10 times slower than FP32, so it's rather unusable.

I realize this is a Torch limitation, but any tips or known workarounds are appreciated.

yechenzhi · 2021-08-19T04:15:02Z

I add a FC layer to make the output embedding much smaller, then finetune the whole model(my FC layer and the BERT) in my downstream task. It works very well.

foadyousef · 2022-05-05T17:23:13Z

I'm using the 2.2.0 version and followed the suggestions to reduce precision. However, when I try the util.semantic_search, I still get the same original error!
RuntimeError: "clamp_min_cpu" not implemented for 'Half'

Wunaiq · 2022-08-18T06:55:54Z

As for my try, the RuntimeError: "clamp_min_cpu" not implemented for 'Half' is because that the function doesn't support FP16 on CPU, but it will work well on GPU when use FP16. So try to move the operation onto GPU.

sidhantls · 2022-09-01T07:47:17Z

@nreimers Hey quick question. If we the train the biencoder with fp=16 mixed precision training (use_amp=True in .fit), then during inference, is it okay to just use model.encode() (which does everything in fp32 as opposed to in fp16) ?

nreimers · 2022-09-01T08:20:19Z

Yes

rossbg · 2023-08-04T13:33:04Z

How to convert a model to fp16 (thus reducing it's size)? clip-ViT-L-14 is 1.7G in fp32 and it'd be great to reduce the size as it takes too much memory when using inference in parallel

sorenmc · 2024-04-25T13:09:21Z

I was not able to find a direct method of loading a model in fp16, but found a hacky workaround using pytorch's .half() method

# Half precision for inference mode - this is a bit of a hack, but it works
from sentence_transformers import SentenceTransformer
bi_encoder = SentenceTransformer(model_name)
for module in bi_encoder.modules:
    module.half()

I didn't see a performance drop in my evaluation script.

tomaarsen · 2024-04-25T15:41:40Z

That's not very hacky at all, in my opinion. The following should also work:

from sentence_transformers import SentenceTransformer

bi_encoder = SentenceTransformer(model_name)
bi_encoder.half()

embeddings = bi_encoder.encode(...)

In an upcoming version, after #2578, you'll be able to pass torch_dtype to the model and directly load the model in your desired precision.

Tom Aarsen

shizhediao · 2024-09-09T15:31:03Z

Hi,

I was wondering what the correct usage for model_kwargs is.
I tried this:
model = SentenceTransformer(dunzhang/stella_en_400M_v5, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})

but got an error:
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float

shizhediao · 2024-09-09T15:40:48Z

Solved it by referring to #2889

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode using FP16 #822

Encode using FP16 #822

fcggamou commented Mar 18, 2021

fcggamou commented Mar 18, 2021

nreimers commented Mar 18, 2021

fcggamou commented Mar 18, 2021 •

edited

Loading

nreimers commented Mar 18, 2021

fcggamou commented Mar 18, 2021

fcggamou commented Mar 19, 2021

yechenzhi commented Aug 19, 2021

foadyousef commented May 5, 2022

Wunaiq commented Aug 18, 2022 •

edited

Loading

sidhantls commented Sep 1, 2022 •

edited

Loading

nreimers commented Sep 1, 2022

rossbg commented Aug 4, 2023

sorenmc commented Apr 25, 2024

tomaarsen commented Apr 25, 2024

shizhediao commented Sep 9, 2024

shizhediao commented Sep 9, 2024

Encode using FP16 #822

Encode using FP16 #822

Comments

fcggamou commented Mar 18, 2021

fcggamou commented Mar 18, 2021

nreimers commented Mar 18, 2021

fcggamou commented Mar 18, 2021 • edited Loading

nreimers commented Mar 18, 2021

fcggamou commented Mar 18, 2021

fcggamou commented Mar 19, 2021

yechenzhi commented Aug 19, 2021

foadyousef commented May 5, 2022

Wunaiq commented Aug 18, 2022 • edited Loading

sidhantls commented Sep 1, 2022 • edited Loading

nreimers commented Sep 1, 2022

rossbg commented Aug 4, 2023

sorenmc commented Apr 25, 2024

tomaarsen commented Apr 25, 2024

shizhediao commented Sep 9, 2024

shizhediao commented Sep 9, 2024

fcggamou commented Mar 18, 2021 •

edited

Loading

Wunaiq commented Aug 18, 2022 •

edited

Loading

sidhantls commented Sep 1, 2022 •

edited

Loading