fp8 support for old gpus #489

JaheimLee · 2024-09-03T07:44:49Z

Hi! I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80) from here. Do you have any plans to support fp8 in BatchDecodeWithPagedKVCacheWrapper or just set data type to fp16 during decode stage on VLLM side like here?

yzh119 · 2024-09-03T07:58:29Z

I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80)

This is not true, the code is compatible with later GPUs such as Ada (sm_89) and Hopper (sm_90), what I mean in that PR is that we haven't used fp8 hardware acceleration units in new GPUs.

The fp8 prefill kernels optimized for Hopper will be landed together with FA3 (#369 ), please stay tuned.

JaheimLee · 2024-09-03T11:39:25Z

I noticed you only support fp8 in BatchPrefillWithPagedKVCacheWrapper for old GPUs(sm_80)

This is not true, the code is compatible with later GPUs such as Ada (sm_89) and Hopper (sm_90), what I mean in that PR is that we haven't used fp8 hardware acceleration units in new GPUs.

The fp8 prefill kernels optimized for Hopper will be landed together with FA3 (#369 ), please stay tuned.

Sorry for the misleading. I know it compatible with later GPUs. I only have Ampere GPU for testing. When I set kv_cache_dtype="fp8_e5m2" in latest VLLM, it will raise the folllowing error because of this line

File "/data/lijinghui/vllm/vllm/worker/model_runner.py", line 1420, in execute_model
    self.attn_state.begin_forward(model_input)
  File "/data/lijinghui/vllm/vllm/attention/backends/flashinfer.py", line 252, in begin_forward
    model_input.attn_metadata.begin_forward()
  File "/data/lijinghui/vllm/vllm/attention/backends/flashinfer.py", line 347, in begin_forward
    self.decode_wrapper.begin_forward(
  File "/data/miniconda3/envs/ljh_py312/lib/python3.12/site-packages/flashinfer/decode.py", line 530, in plan
    self._wrapper.plan(
RuntimeError: BatchPrefillWithPagedKVCachePyTorchWrapper::Plan(at::Tensor, at::Tensor, at::Tensor, at::Tensor, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, at::Tensor)::<lambda()> failed to dispatch data type Float8_e5m2

If I comment it out, everything works well. I find the type of self.decode_wrapper is BatchDecodeWithPagedKVCacheWrapper. So I wonder whether BatchDecodeWithPagedKVCacheWrapper will support fp8 on Ampere GPUs?

zhyncs · 2024-09-04T08:25:25Z

@JaheimLee This should be an implementation issue of vLLM. We also integrated FlashInfer FP8 KV Cache in SGLang, which can run normally on sm80. ref sgl-project/sglang#1204 You may try it!

JaheimLee · 2024-09-05T03:40:12Z

@JaheimLee This should be an implementation issue of vLLM. We also integrated FlashInfer FP8 KV Cache in SGLang, which can run normally on sm80. ref sgl-project/sglang#1204 You may try it!

ok, thanks

JaheimLee closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 support for old gpus #489

fp8 support for old gpus #489

JaheimLee commented Sep 3, 2024 •

edited

Loading

yzh119 commented Sep 3, 2024

JaheimLee commented Sep 3, 2024

zhyncs commented Sep 4, 2024

JaheimLee commented Sep 5, 2024

fp8 support for old gpus #489

fp8 support for old gpus #489

Comments

JaheimLee commented Sep 3, 2024 • edited Loading

yzh119 commented Sep 3, 2024

JaheimLee commented Sep 3, 2024

zhyncs commented Sep 4, 2024

JaheimLee commented Sep 5, 2024

JaheimLee commented Sep 3, 2024 •

edited

Loading