Add CUDA kernel for per_token_group_quant_fp8 #14175

mgoin · 2025-03-04T03:14:30Z

Currently failing some shapes

E       AssertionError: assert False
E        +  where False = <built-in method allclose of type object at 0x76f6c14c5280>(tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0'), tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0'), rtol=0.15)
E        +    where <built-in method allclose of type object at 0x76f6c14c5280> = torch.allclose
E        +    and   tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0') = <built-in method to of Tensor object at 0x76f52b91a7b0>(torch.float32)
E        +      where <built-in method to of Tensor object at 0x76f52b91a7b0> = tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,....,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0',\n       dtype=torch.float8_e4m3fn).to
E        +      and   torch.float32 = torch.float32
E        +    and   tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0') = <built-in method to of Tensor object at 0x76f52ba25040>(torch.float32)
E        +      where <built-in method to of Tensor object at 0x76f52ba25040> = tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,....,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0',\n       dtype=torch.float8_e4m3fn).to
E        +      and   torch.float32 = torch.float32

tests/kernels/test_block_fp8.py:198: AssertionError
--------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------
num_tokens=83, d=13824, dtype=torch.bfloat16, group_size=64, seed=0
ref_out tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],
        [448., 288., 224.,  ..., 256.,  52., 320.],
        [ 96., 144., 256.,  ..., 288.,  80., 320.],
        ...,
        [128., 416.,  72.,  ..., 384., 352., 128.],
        [112., 384., 256.,  ..., 384.,  40., 288.],
        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0')
out     tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],
        [448., 288., 224.,  ..., 256.,  52., 320.],
        [ 96., 144., 256.,  ..., 288.,  80., 320.],
        ...,
        [128., 416.,  72.,  ..., 384., 352., 128.],
        [112., 384., 256.,  ..., 384.,  40., 288.],
        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0')
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
tensor(32., device='cuda:0')
======================================================================================================= short test summary info ========================================================================================================
FAILED tests/kernels/test_block_fp8.py::test_cuda_per_token_group_quant_fp8[83-13824-dtype28-64-0] - AssertionError: assert False
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================================= 1 failed, 28 passed, 208 deselected in 7.64s =============================================================================================

github-actions · 2025-03-04T03:14:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tlrmchlsmth · 2025-03-04T16:32:44Z

csrc/quantization/fp8/per_token_group_quant_fp8.cu

where does the 512.f come from?

I copied this from dynamic_per_token_scaled_fp8_quant_kernel

vllm/csrc/quantization/fp8/common.cu

Line 33 in f89978a

float const min_scaling_factor = 1.0f / (FP8_E4M3_MAX * 512.f);

Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson · 2025-03-08T14:31:46Z

Seeing a minor boost over: #14476 when combined with that

  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
3    vllm          1000           1000     982.249267  0.982249      42.002411     2403.797040
2    vllm          5000           1000     521.687045  0.521687      38.893414     6001.047062
4    vllm         10000           1000     331.643844  0.331644      36.228126    54635.778229
1    vllm         32000           1000     113.265287  0.113265      36.442538   183086.203407

mergify · 2025-04-23T14:38:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot added the ci/build label Mar 4, 2025

tlrmchlsmth reviewed Mar 4, 2025

View reviewed changes

mgoin and others added 2 commits March 8, 2025 01:29

Add CUDA kernel for per_token_group_quant_fp8

aaf570b

Signed-off-by: mgoin <mgoin64@gmail.com>

fix tests

314d1a8

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the per_token_group_quant_fp8-cuda-kernel branch from 7905345 to 314d1a8 Compare March 8, 2025 02:23

remove print

18dc0b3

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added the needs-rebase label Apr 23, 2025

mgoin added 2 commits May 29, 2025 03:27

Merge branch 'main' into per_token_group_quant_fp8-cuda-kernel

a1dc2e3

Signed-off-by: mgoin <mgoin64@gmail.com>

Update with modern main, readd test, and add benchmark

0b44843

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label May 29, 2025

mergify bot added the performance Performance-related issues label Jun 23, 2025

mgoin closed this Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add CUDA kernel for per_token_group_quant_fp8 #14175

Add CUDA kernel for per_token_group_quant_fp8 #14175

mgoin commented Mar 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 4, 2025

Uh oh!

tlrmchlsmth Mar 4, 2025

Uh oh!

mgoin Mar 4, 2025

Uh oh!

LucasWilkinson commented Mar 8, 2025

Uh oh!

mergify bot commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add CUDA kernel for per_token_group_quant_fp8 #14175

Add CUDA kernel for per_token_group_quant_fp8 #14175

Conversation

mgoin commented Mar 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2025

Uh oh!

tlrmchlsmth Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Mar 8, 2025

Uh oh!

mergify bot commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgoin commented Mar 4, 2025 •

edited by github-actions bot

Loading