FP8 KV cache support #10

HaiShaw · 2024-06-04T00:00:09Z

To add quantization support to KV cache, into state dict.
Static (as activation) is needed for performance.
Dynamic can be added for completeness.

mgoin · 2024-06-04T19:01:36Z

Thanks for the request @HaiShaw, this is a next step to tackle!

We have an example model checkpoint here that I made with a one-off script https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV

Specifically, you can see in the checkpoint we store a kv_scale tensor for each attention module

HaiShaw · 2024-06-05T18:15:59Z

@mgoin , great to that you are quite ready for this. Thanks for the details!

zitgit · 2024-06-18T07:30:07Z

@mgoin Thanks! A quick question: is it possible to reproduce neuralmagic/Meta-Llama-3-70B-Instruct-FP8 by utilizing offline static quantization method (model.quantize(examples)) to Meta-Llama-3-70B-Instruct ?Addtionally, cant wait to see more details about kvcache quantization.

mgoin · 2024-06-18T14:19:22Z

@zitgit Yes you can reproduce that 70B model by following the dataset example and replacing the model with whatever you'd like https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py

We are working on kvcache quantization in AutoFP8 here #17

zitgit · 2024-06-19T10:12:34Z

@mgoin I really appreciate your reply! While I have trouble quantizing Llama3-70B which requires much more memory to process per tensor and save. Is is possible to quantize part of the model each time and finally merge the safetensors using the parameter[ignored_layers] ? Many thanks!

mgoin · 2024-06-19T13:29:55Z

@zitgit How much memory are you seeing used? As of current main, it should only require peak memory equivalent to loading the model in original precision (~140GB) as we immediately quantize the weights and then begin calibration of the activations.

zitgit · 2024-06-20T07:03:46Z

@mgoin It works on current main! I noticed that "del linear.weight" is necessary in func quantize_weights(). Thank you a lot!!
And I'd like to ask another question politely. I noticed the comment on kvcache quantization says some arguments needed to match the representation in vllm. Do both w8a8 and kvcache fp8 inference strongly replied on vllm? Or I can use other engines.

mgoin · 2024-06-20T15:08:42Z

@zitgit We are focused on format and performance in vLLM since that is the best open-source inference server with full support for FP8. AFAIK the only other option if trt-llm and it has a custom format that isn't really HF transformers-compatible, which is also what we are going for here format-wise.

I am going to close this issue since fp8 kv cache is now supported (which was the original issue). Please open a new issue if you'd like to continue conversation, thanks!

HaiShaw · 2024-06-24T20:27:56Z

@mgoin , it would be nice if you could update the screenshot above - I think input_scale is used in place of act_scale. If possible can you also show output_scale in it?

mgoin added the enhancement New feature or request label Jun 4, 2024

mgoin linked a pull request Jun 20, 2024 that will close this issue

Support calibrating kv cache scales #17

Merged

mgoin closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 KV cache support #10

FP8 KV cache support #10

HaiShaw commented Jun 4, 2024

mgoin commented Jun 4, 2024 •

edited

Loading

HaiShaw commented Jun 5, 2024

zitgit commented Jun 18, 2024 •

edited

Loading

mgoin commented Jun 18, 2024

zitgit commented Jun 19, 2024 •

edited

Loading

mgoin commented Jun 19, 2024

zitgit commented Jun 20, 2024

mgoin commented Jun 20, 2024

HaiShaw commented Jun 24, 2024

FP8 KV cache support #10

FP8 KV cache support #10

Comments

HaiShaw commented Jun 4, 2024

mgoin commented Jun 4, 2024 • edited Loading

HaiShaw commented Jun 5, 2024

zitgit commented Jun 18, 2024 • edited Loading

mgoin commented Jun 18, 2024

zitgit commented Jun 19, 2024 • edited Loading

mgoin commented Jun 19, 2024

zitgit commented Jun 20, 2024

mgoin commented Jun 20, 2024

HaiShaw commented Jun 24, 2024

mgoin commented Jun 4, 2024 •

edited

Loading

zitgit commented Jun 18, 2024 •

edited

Loading

zitgit commented Jun 19, 2024 •

edited

Loading