Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

prise6 · 2024-04-16T10:12:50Z

I'm wondering why the inference time (tps) is slower for quantized models compared to non quantized llm ?

Source: resource-tables

Size	Model	Quantization	GPU	Max GPU RAM	Token/sec
7 B	Llama 2	None	1 x A100	13.52 GB	30.97
7 B	Llama 2	bnb.nf4	1 x A100	4.57 GB	19.98
7 B	Llama 2	bnb.nf4-dq	1 x A100	4.26 GB	17.3

Is it because bnb.nf4 is not the correct quantization method for inference ?
References: https://huggingface.co/blog/overview-quantization-transformers#Diving-into-speed-benchmarks

Thanks for your replies :)

Andrei-Aksionov · 2024-04-16T10:55:12Z

Hello @prise6

For every forward pass, Bitsandbytes (BNB) has to dequantize weights, which naturally reduces speed of token generation.
BNB is for a memory reduction, and it supports training, that's why it is quite popular.

If you want a memory consumption reduction and the same tps as for the non-quantized model, I recommend looking at AutoGPTQ. You can find a table with speed comparison in #924.

LitGPT doesn't support AutoGPTQ at the moment, since it's more focused on the training side of things.
But you can convert LitGPT model (after pretrain/finetune) into HuggingFace format and then apply AutoGPTQ on it.

prise6 · 2024-04-16T12:18:40Z

Thank you @Andrei-Aksionov for your reply, i appreciate it.

Is GPTQ also a quantization method which needs to dequantize weights on inference when needed ?

If i want to be faster on TPS, i should try quantized llm but without dequantized weights on inference, right ? Something like fp8 ?

Andrei-Aksionov · 2024-04-16T13:34:25Z

Quantization leads to loss in accuracy, especially in 4bit precision. Thus there are different approaches to mitigate it.
With GPTQ you need to do a post-training quantization and calibration (more about it here ), while with BNB you don't need to. BNB uses more sophisticated algos to have a good precision without calibration, but it adds ann overhead. GPTQ doesn't need to do it thanks to calibration process which makes it a bit faster on inference.

Is GPTQ also a quantization method which needs to dequantize weights on inference when needed ?

Yes, it stores weights in a quantized form and then dequantizes them during a forward pass.
I'm not 100% sure, but I think that all the 4bit quantization algorithms do this in order to not lose original information too much.

If i want to be faster on TPS, i should try quantized llm but without dequantized weights on inference, right ? Something like fp8 ?

Yes, that should help. Though I didn't try it personally 🙂.

There is a new quantization approach in town: HQQ. Again, didn't try it, but the numbers looks impressive, plus it looks like it support torch.compile which should also speed up token generation quite a bit.

rasbt · 2024-04-16T14:18:11Z

Like you explain @Andrei-Aksionov there is usually an extra computational overhead when using quantization. You basically trade of compute for memory.

prise6 · 2024-04-16T14:28:23Z

thanks for all these infos ! Very clear. To sum up my understandings:

So, saying that quantization speed up throughput is not really true compared to original llm (for same gpu). It's even mostly the contrary according to your benchmark (for example).

Still, it seems to depend on llm size, batch size, sequences length, backend engine... discussion, optimum bench

Moreover, new quantization approachs should be taken into account.

prise6 closed this as completed Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

prise6 commented Apr 16, 2024

Andrei-Aksionov commented Apr 16, 2024

prise6 commented Apr 16, 2024

Andrei-Aksionov commented Apr 16, 2024

rasbt commented Apr 16, 2024

prise6 commented Apr 16, 2024

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

Question about inference time (resource-tables) : tps quantized llm < non quantized llm #1302

Comments

prise6 commented Apr 16, 2024

Andrei-Aksionov commented Apr 16, 2024

prise6 commented Apr 16, 2024

Andrei-Aksionov commented Apr 16, 2024

rasbt commented Apr 16, 2024

prise6 commented Apr 16, 2024