Is it possible to Quantize Sentence Transformer models? #2968

Connor56 · 2024-09-30T15:05:27Z

Machine Specs

Device: Macbook Pro
CPU: Apple M3 Pro
Memory: 18GB
OS: MacOS Sonoma

Problem

I'm working with one of your 80MB models, the embeddings are great, but the performance could be faster for my use case. I want to quantise the model to 8 bits to run faster. I've tried to do that with this code:

import torch
from sentence_transformers import SentenceTransformer
torch.backends.quantized.engine = 'qnnpack'

model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

quantized_model = torch.quantization.quantize_dynamic(
    model, 
    {torch.nn.Linear},  # layers to quantize
    dtype=torch.qint8  # quantization data type
)

To my surprise, this halved the model's performance!

I've searched your docs, but I can't find anything on the best way to quantise your models. Is there a standard approach I should be following?

tomaarsen · 2024-09-30T18:06:44Z

Hello!

this halved the model's performance!

Are you referring to the throughput, i.e. inference speed? Or the evaluation performance on a benchmark of yours?

Something to note is that while int8 is commonly used for LLMs, it's primarily used to shrink the memory usage (at least, to my knowledge). Beyond that, I'm not very familiar with the quantize_dynamic quantization code from torch.

Another thing to consider is that a GPU might have solid int8 operations, but a CPU might not. I.e. it might be faster on GPU, but slower on CPU. I actually think this is the big difference.

The upcoming release will introduce some more options for speeding up your models:

Add backends: ONNX & OpenVINO + ONNX optimization, quantization #2712 will add support for the OpenVINO and ONNX backends, rather than just torch. It'll also add a function that optimizes a model with ONNX, e.g. the O4 Optimization will also use quantization to fp16 (but it's GPU only, otherwise use O3). I added all of these files in here as a quick test: https://huggingface.co/tomaarsen/all-MiniLM-L6-v2-copy/tree/main
[feat] Add lightning-fast StaticEmbedding module based on model2vec #2961 will add support for model2vec, i.e. a library for converting a model into a set of vectors. So, rather than "running inference", you'll just grab token embeddings from an EmbeddingBag and calculate the mean. In my tests, it was about 300x faster on CPU.

In the meantime, you can either experiment with those PRs if you're interested (you can install them directly with pip install git+https://github.com/PR_USER/sentence-transformers.git@NAME_OF_BRANCH), and otherwise you can use float16 (model.half()) or bfloat16 (model.bfloat16()), but those only help with GPUs.

Tom Aarsen

tomaarsen · 2024-10-17T16:41:28Z

Hello!

I've added native ONNX support in Sentence Transformers, so users can now look at the Speeding up Inference documentation, specifically the "Quantizing ONNX Models" should be valuable, as it allows for int8 quantization, which are much faster on CPUs:

from sentence_transformers import SentenceTransformer, export_dynamic_quantized_onnx_model

model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="onnx")
export_dynamic_quantized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")

Tom Aarsen

Connor56 · 2024-10-22T12:57:23Z

Awesome! Thanks Tom, to clarify, it was the speed of inference that halved not the accuracy (I didn't check). That model2vec thing sounds incredible. Do they give a good explanation on their git of why it's so fast?

tomaarsen · 2024-10-22T13:28:10Z

I think so, yeah: https://github.com/MinishLab/model2vec

As a tl;dr: M2V models don't have any Transformers layers. They are just token/word embeddings that get averaged. The token/word embeddings are "distilled" from a Sentence Transformer model, which is why the approach is better than e.g. GLoVe or word2vec.

Tom Aarsen

tomaarsen closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to Quantize Sentence Transformer models? #2968

Is it possible to Quantize Sentence Transformer models? #2968

Connor56 commented Sep 30, 2024 •

edited

Loading

tomaarsen commented Sep 30, 2024

tomaarsen commented Oct 17, 2024

Connor56 commented Oct 22, 2024

tomaarsen commented Oct 22, 2024

Is it possible to Quantize Sentence Transformer models? #2968

Is it possible to Quantize Sentence Transformer models? #2968

Comments

Connor56 commented Sep 30, 2024 • edited Loading

Machine Specs

Problem

tomaarsen commented Sep 30, 2024

tomaarsen commented Oct 17, 2024

Connor56 commented Oct 22, 2024

tomaarsen commented Oct 22, 2024

Connor56 commented Sep 30, 2024 •

edited

Loading