Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

Closed
prise6 opened this issue Apr 16, 2024 · 5 comments

Comments

@prise6
Copy link

prise6 commented Apr 16, 2024

I'm wondering why the inference time (tps) is slower for quantized models compared to non quantized llm ?

Source: resource-tables

Size Model Quantization GPU Max GPU RAM Token/sec
7 B Llama 2 None 1 x A100 13.52 GB 30.97
7 B Llama 2 bnb.nf4 1 x A100 4.57 GB 19.98
7 B Llama 2 bnb.nf4-dq 1 x A100 4.26 GB 17.3

Is it because bnb.nf4 is not the correct quantization method for inference ?
References: https://huggingface.co/blog/overview-quantization-transformers#Diving-into-speed-benchmarks

Thanks for your replies :)

@Andrei-Aksionov
Copy link
Contributor

Hello @prise6

For every forward pass, Bitsandbytes (BNB) has to dequantize weights, which naturally reduces speed of token generation.
BNB is for a memory reduction, and it supports training, that's why it is quite popular.

If you want a memory consumption reduction and the same tps as for the non-quantized model, I recommend looking at AutoGPTQ. You can find a table with speed comparison in #924.

LitGPT doesn't support AutoGPTQ at the moment, since it's more focused on the training side of things.
But you can convert LitGPT model (after pretrain/finetune) into HuggingFace format and then apply AutoGPTQ on it.

@prise6
Copy link
Author

prise6 commented Apr 16, 2024

Thank you @Andrei-Aksionov for your reply, i appreciate it.

Is GPTQ also a quantization method which needs to dequantize weights on inference when needed ?

If i want to be faster on TPS, i should try quantized llm but without dequantized weights on inference, right ? Something like fp8 ?

@Andrei-Aksionov
Copy link
Contributor

Quantization leads to loss in accuracy, especially in 4bit precision. Thus there are different approaches to mitigate it.
With GPTQ you need to do a post-training quantization and calibration (more about it here ), while with BNB you don't need to. BNB uses more sophisticated algos to have a good precision without calibration, but it adds ann overhead. GPTQ doesn't need to do it thanks to calibration process which makes it a bit faster on inference.

Is GPTQ also a quantization method which needs to dequantize weights on inference when needed ?

Yes, it stores weights in a quantized form and then dequantizes them during a forward pass.
I'm not 100% sure, but I think that all the 4bit quantization algorithms do this in order to not lose original information too much.

If i want to be faster on TPS, i should try quantized llm but without dequantized weights on inference, right ? Something like fp8 ?

Yes, that should help. Though I didn't try it personally 🙂.


There is a new quantization approach in town: HQQ. Again, didn't try it, but the numbers looks impressive, plus it looks like it support torch.compile which should also speed up token generation quite a bit.

@rasbt
Copy link
Contributor

rasbt commented Apr 16, 2024

Like you explain @Andrei-Aksionov there is usually an extra computational overhead when using quantization. You basically trade of compute for memory.

@prise6
Copy link
Author

prise6 commented Apr 16, 2024

thanks for all these infos ! Very clear. To sum up my understandings:

So, saying that quantization speed up throughput is not really true compared to original llm (for same gpu). It's even mostly the contrary according to your benchmark (for example).

Still, it seems to depend on llm size, batch size, sequences length, backend engine... discussion, optimum bench

Moreover, new quantization approachs should be taken into account.

@prise6 prise6 closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants