-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP8 KV cache support #10
Comments
Thanks for the request @HaiShaw, this is a next step to tackle! We have an example model checkpoint here that I made with a one-off script https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV Specifically, you can see in the checkpoint we store a kv_scale tensor for each attention module |
@mgoin , great to that you are quite ready for this. Thanks for the details! |
@mgoin Thanks! A quick question: is it possible to reproduce neuralmagic/Meta-Llama-3-70B-Instruct-FP8 by utilizing offline static quantization method (model.quantize(examples)) to Meta-Llama-3-70B-Instruct ?Addtionally, cant wait to see more details about kvcache quantization. |
@zitgit Yes you can reproduce that 70B model by following the dataset example and replacing the model with whatever you'd like https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py We are working on kvcache quantization in AutoFP8 here #17 |
@mgoin I really appreciate your reply! While I have trouble quantizing Llama3-70B which requires much more memory to process per tensor and save. Is is possible to quantize part of the model each time and finally merge the safetensors using the parameter[ignored_layers] ? Many thanks! |
@zitgit How much memory are you seeing used? As of current main, it should only require peak memory equivalent to loading the model in original precision (~140GB) as we immediately quantize the weights and then begin calibration of the activations. |
@mgoin It works on current main! I noticed that "del linear.weight" is necessary in func quantize_weights(). Thank you a lot!! |
@zitgit We are focused on format and performance in vLLM since that is the best open-source inference server with full support for FP8. AFAIK the only other option if trt-llm and it has a custom format that isn't really HF transformers-compatible, which is also what we are going for here format-wise. I am going to close this issue since fp8 kv cache is now supported (which was the original issue). Please open a new issue if you'd like to continue conversation, thanks! |
@mgoin , it would be nice if you could update the screenshot above - I think |
To add quantization support to KV cache, into state dict.
Static (as activation) is needed for performance.
Dynamic can be added for completeness.
The text was updated successfully, but these errors were encountered: