-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fp8 support for old gpus #489
Comments
This is not true, the code is compatible with later GPUs such as Ada (sm_89) and Hopper (sm_90), what I mean in that PR is that we haven't used fp8 hardware acceleration units in new GPUs. The fp8 prefill kernels optimized for Hopper will be landed together with FA3 (#369 ), please stay tuned. |
Sorry for the misleading. I know it compatible with later GPUs. I only have Ampere GPU for testing. When I set kv_cache_dtype="fp8_e5m2" in latest VLLM, it will raise the folllowing error because of this line
If I comment it out, everything works well. I find the type of self.decode_wrapper is BatchDecodeWithPagedKVCacheWrapper. So I wonder whether BatchDecodeWithPagedKVCacheWrapper will support fp8 on Ampere GPUs? |
@JaheimLee This should be an implementation issue of vLLM. We also integrated FlashInfer FP8 KV Cache in SGLang, which can run normally on sm80. ref sgl-project/sglang#1204 You may try it! |
ok, thanks |
Hi! I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80) from here. Do you have any plans to support fp8 in BatchDecodeWithPagedKVCacheWrapper or just set data type to fp16 during decode stage on VLLM side like here?
The text was updated successfully, but these errors were encountered: