-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to Quantize Sentence Transformer models? #2968
Comments
Hello!
Are you referring to the throughput, i.e. inference speed? Or the evaluation performance on a benchmark of yours? Something to note is that while int8 is commonly used for LLMs, it's primarily used to shrink the memory usage (at least, to my knowledge). Beyond that, I'm not very familiar with the Another thing to consider is that a GPU might have solid int8 operations, but a CPU might not. I.e. it might be faster on GPU, but slower on CPU. I actually think this is the big difference. The upcoming release will introduce some more options for speeding up your models:
In the meantime, you can either experiment with those PRs if you're interested (you can install them directly with
|
Hello! I've added native ONNX support in Sentence Transformers, so users can now look at the Speeding up Inference documentation, specifically the "Quantizing ONNX Models" should be valuable, as it allows for int8 quantization, which are much faster on CPUs: from sentence_transformers import SentenceTransformer, export_dynamic_quantized_onnx_model
model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="onnx")
export_dynamic_quantized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")
|
Awesome! Thanks Tom, to clarify, it was the speed of inference that halved not the accuracy (I didn't check). That model2vec thing sounds incredible. Do they give a good explanation on their git of why it's so fast? |
I think so, yeah: https://github.com/MinishLab/model2vec As a tl;dr: M2V models don't have any Transformers layers. They are just token/word embeddings that get averaged. The token/word embeddings are "distilled" from a Sentence Transformer model, which is why the approach is better than e.g. GLoVe or word2vec.
|
Machine Specs
Device: Macbook Pro
CPU: Apple M3 Pro
Memory: 18GB
OS: MacOS Sonoma
Problem
I'm working with one of your 80MB models, the embeddings are great, but the performance could be faster for my use case. I want to quantise the model to 8 bits to run faster. I've tried to do that with this code:
To my surprise, this halved the model's performance!
I've searched your docs, but I can't find anything on the best way to quantise your models. Is there a standard approach I should be following?
The text was updated successfully, but these errors were encountered: