Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 KV cache support #10

Closed
HaiShaw opened this issue Jun 4, 2024 · 9 comments · Fixed by #17
Closed

FP8 KV cache support #10

HaiShaw opened this issue Jun 4, 2024 · 9 comments · Fixed by #17
Labels
enhancement New feature or request

Comments

@HaiShaw
Copy link

HaiShaw commented Jun 4, 2024

To add quantization support to KV cache, into state dict.
Static (as activation) is needed for performance.
Dynamic can be added for completeness.

@mgoin mgoin added the enhancement New feature or request label Jun 4, 2024
@mgoin
Copy link
Member

mgoin commented Jun 4, 2024

Thanks for the request @HaiShaw, this is a next step to tackle!

We have an example model checkpoint here that I made with a one-off script https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV

Specifically, you can see in the checkpoint we store a kv_scale tensor for each attention module

image

@HaiShaw
Copy link
Author

HaiShaw commented Jun 5, 2024

@mgoin , great to that you are quite ready for this. Thanks for the details!

@zitgit
Copy link

zitgit commented Jun 18, 2024

@mgoin Thanks! A quick question: is it possible to reproduce neuralmagic/Meta-Llama-3-70B-Instruct-FP8 by utilizing offline static quantization method (model.quantize(examples)) to Meta-Llama-3-70B-Instruct ?Addtionally, cant wait to see more details about kvcache quantization.

@mgoin
Copy link
Member

mgoin commented Jun 18, 2024

@zitgit Yes you can reproduce that 70B model by following the dataset example and replacing the model with whatever you'd like https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py

We are working on kvcache quantization in AutoFP8 here #17

@zitgit
Copy link

zitgit commented Jun 19, 2024

@mgoin I really appreciate your reply! While I have trouble quantizing Llama3-70B which requires much more memory to process per tensor and save. Is is possible to quantize part of the model each time and finally merge the safetensors using the parameter[ignored_layers] ? Many thanks!

@mgoin
Copy link
Member

mgoin commented Jun 19, 2024

@zitgit How much memory are you seeing used? As of current main, it should only require peak memory equivalent to loading the model in original precision (~140GB) as we immediately quantize the weights and then begin calibration of the activations.

@zitgit
Copy link

zitgit commented Jun 20, 2024

@mgoin It works on current main! I noticed that "del linear.weight" is necessary in func quantize_weights(). Thank you a lot!!
And I'd like to ask another question politely. I noticed the comment on kvcache quantization says some arguments needed to match the representation in vllm. Do both w8a8 and kvcache fp8 inference strongly replied on vllm? Or I can use other engines.

@mgoin mgoin linked a pull request Jun 20, 2024 that will close this issue
@mgoin
Copy link
Member

mgoin commented Jun 20, 2024

@zitgit We are focused on format and performance in vLLM since that is the best open-source inference server with full support for FP8. AFAIK the only other option if trt-llm and it has a custom format that isn't really HF transformers-compatible, which is also what we are going for here format-wise.

I am going to close this issue since fp8 kv cache is now supported (which was the original issue). Please open a new issue if you'd like to continue conversation, thanks!

@mgoin mgoin closed this as completed Jun 20, 2024
@HaiShaw
Copy link
Author

HaiShaw commented Jun 24, 2024

@mgoin , it would be nice if you could update the screenshot above - I think input_scale is used in place of act_scale. If possible can you also show output_scale in it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants