-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302
Comments
Hello @prise6 For every forward pass, Bitsandbytes (BNB) has to dequantize weights, which naturally reduces speed of token generation. If you want a memory consumption reduction and the same tps as for the non-quantized model, I recommend looking at AutoGPTQ. You can find a table with speed comparison in #924. LitGPT doesn't support AutoGPTQ at the moment, since it's more focused on the training side of things. |
Thank you @Andrei-Aksionov for your reply, i appreciate it. Is GPTQ also a quantization method which needs to dequantize weights on inference when needed ? If i want to be faster on TPS, i should try quantized llm but without dequantized weights on inference, right ? Something like fp8 ? |
Quantization leads to loss in accuracy, especially in 4bit precision. Thus there are different approaches to mitigate it.
Yes, it stores weights in a quantized form and then dequantizes them during a forward pass.
Yes, that should help. Though I didn't try it personally 🙂. There is a new quantization approach in town: HQQ. Again, didn't try it, but the numbers looks impressive, plus it looks like it support |
Like you explain @Andrei-Aksionov there is usually an extra computational overhead when using quantization. You basically trade of compute for memory. |
thanks for all these infos ! Very clear. To sum up my understandings: So, saying that quantization speed up throughput is not really true compared to original llm (for same gpu). It's even mostly the contrary according to your benchmark (for example). Still, it seems to depend on llm size, batch size, sequences length, backend engine... discussion, optimum bench Moreover, new quantization approachs should be taken into account. |
I'm wondering why the inference time (tps) is slower for quantized models compared to non quantized llm ?
Source: resource-tables
Is it because bnb.nf4 is not the correct quantization method for inference ?
References: https://huggingface.co/blog/overview-quantization-transformers#Diving-into-speed-benchmarks
Thanks for your replies :)
The text was updated successfully, but these errors were encountered: