-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use low-bit KV Cache in flashinfer? #125
Comments
I haven't exposed low-bit KV-Cache in PyTorch APIs (they are available in C++ APIs), will do it tmr :) |
Glad to hear that! Cannot wait to try it out. I think quantizing KV Cache from float16/bfloat16 to 4-bits will need calibration. It will be better if the feature released with demo and benchmark results (latency, throughput or accuracy). BTW, there is already someone trying to port flashinfer to vLLM (see #2772) to boost decode phase. I also ported FlashAttention to vLLM (see #2744) and plan to benchmark FA and flashinfer in vLLM framwork. |
Thanks for letting me know, it's interesting to see that FlashAttention starts supporting paged kv-cache. |
You can check our manuscript: Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. |
PyTorch APIs for fp8 kv-cache are exposed in #156 . I'm finalizing the int4/int8 fused-dequant attention kernels with some optimizations such as fast int4/int8-to-float16 conversions. I expect to merge these changes by this Thursday. |
Hi @yzh119 As mentioned in https://flashinfer.ai/2024/02/02/introduce-flashinfer.html.
When is Atom quantization expected to be fully integrated into FlashInfer? Is there a detailed timeline available? Thanks. |
Hi, is there any plan to integrate the 4-bit fused dequantize+attention operators proposed in Atom into FlashInfer? Looking forward for this new feature. |
From the blog I noticed that FlashInfer implements low-precision attention kernels so that we can achieve nearly linear speedup to the compression ratio (~4x for 4bit, ~2x for 8bit). This feature is great! and I try to use it. But there is no demo or toy code about how to use it. Could you please share more details about it?
The text was updated successfully, but these errors were encountered: