Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fp8 support for old gpus #489

Closed
JaheimLee opened this issue Sep 3, 2024 · 4 comments
Closed

fp8 support for old gpus #489

JaheimLee opened this issue Sep 3, 2024 · 4 comments

Comments

@JaheimLee
Copy link

JaheimLee commented Sep 3, 2024

Hi! I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80) from here. Do you have any plans to support fp8 in BatchDecodeWithPagedKVCacheWrapper or just set data type to fp16 during decode stage on VLLM side like here?

@yzh119
Copy link
Collaborator

yzh119 commented Sep 3, 2024

I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80)

This is not true, the code is compatible with later GPUs such as Ada (sm_89) and Hopper (sm_90), what I mean in that PR is that we haven't used fp8 hardware acceleration units in new GPUs.

The fp8 prefill kernels optimized for Hopper will be landed together with FA3 (#369 ), please stay tuned.

@JaheimLee
Copy link
Author

I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80)

This is not true, the code is compatible with later GPUs such as Ada (sm_89) and Hopper (sm_90), what I mean in that PR is that we haven't used fp8 hardware acceleration units in new GPUs.

The fp8 prefill kernels optimized for Hopper will be landed together with FA3 (#369 ), please stay tuned.

Sorry for the misleading. I know it compatible with later GPUs. I only have Ampere GPU for testing. When I set kv_cache_dtype="fp8_e5m2" in latest VLLM, it will raise the folllowing error because of this line

File "/data/lijinghui/vllm/vllm/worker/model_runner.py", line 1420, in execute_model
    self.attn_state.begin_forward(model_input)
  File "/data/lijinghui/vllm/vllm/attention/backends/flashinfer.py", line 252, in begin_forward
    model_input.attn_metadata.begin_forward()
  File "/data/lijinghui/vllm/vllm/attention/backends/flashinfer.py", line 347, in begin_forward
    self.decode_wrapper.begin_forward(
  File "/data/miniconda3/envs/ljh_py312/lib/python3.12/site-packages/flashinfer/decode.py", line 530, in plan
    self._wrapper.plan(
RuntimeError: BatchPrefillWithPagedKVCachePyTorchWrapper::Plan(at::Tensor, at::Tensor, at::Tensor, at::Tensor, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, at::Tensor)::<lambda()> failed to dispatch data type Float8_e5m2

If I comment it out, everything works well. I find the type of self.decode_wrapper is BatchDecodeWithPagedKVCacheWrapper. So I wonder whether BatchDecodeWithPagedKVCacheWrapper will support fp8 on Ampere GPUs?

@zhyncs
Copy link
Member

zhyncs commented Sep 4, 2024

@JaheimLee This should be an implementation issue of vLLM. We also integrated FlashInfer FP8 KV Cache in SGLang, which can run normally on sm80. ref sgl-project/sglang#1204 You may try it!

@JaheimLee
Copy link
Author

@JaheimLee This should be an implementation issue of vLLM. We also integrated FlashInfer FP8 KV Cache in SGLang, which can run normally on sm80. ref sgl-project/sglang#1204 You may try it!

ok, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants