Skip to content

Conversation

@mgoin
Copy link
Member

@mgoin mgoin commented Mar 4, 2025

Currently failing some shapes

E       AssertionError: assert False
E        +  where False = <built-in method allclose of type object at 0x76f6c14c5280>(tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0'), tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0'), rtol=0.15)
E        +    where <built-in method allclose of type object at 0x76f6c14c5280> = torch.allclose
E        +    and   tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0') = <built-in method to of Tensor object at 0x76f52b91a7b0>(torch.float32)
E        +      where <built-in method to of Tensor object at 0x76f52b91a7b0> = tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,....,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0',\n       dtype=torch.float8_e4m3fn).to
E        +      and   torch.float32 = torch.float32
E        +    and   tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,...      [112., 384., 256.,  ..., 384.,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0') = <built-in method to of Tensor object at 0x76f52ba25040>(torch.float32)
E        +      where <built-in method to of Tensor object at 0x76f52ba25040> = tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],\n        [448., 288., 224.,  ..., 256.,  52., 320.],\n        [ 96.,....,  40., 288.],\n        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0',\n       dtype=torch.float8_e4m3fn).to
E        +      and   torch.float32 = torch.float32

tests/kernels/test_block_fp8.py:198: AssertionError
--------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------
num_tokens=83, d=13824, dtype=torch.bfloat16, group_size=64, seed=0
ref_out tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],
        [448., 288., 224.,  ..., 256.,  52., 320.],
        [ 96., 144., 256.,  ..., 288.,  80., 320.],
        ...,
        [128., 416.,  72.,  ..., 384., 352., 128.],
        [112., 384., 256.,  ..., 384.,  40., 288.],
        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0')
out     tensor([[176., 240.,  11.,  ..., 256.,  11., 384.],
        [448., 288., 224.,  ..., 256.,  52., 320.],
        [ 96., 144., 256.,  ..., 288.,  80., 320.],
        ...,
        [128., 416.,  72.,  ..., 384., 352., 128.],
        [112., 384., 256.,  ..., 384.,  40., 288.],
        [416., 144., 416.,  ...,  88., 288., 256.]], device='cuda:0')
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')
tensor(32., device='cuda:0')
======================================================================================================= short test summary info ========================================================================================================
FAILED tests/kernels/test_block_fp8.py::test_cuda_per_token_group_quant_fp8[83-13824-dtype28-64-0] - AssertionError: assert False
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================================= 1 failed, 28 passed, 208 deselected in 7.64s =============================================================================================

@github-actions
Copy link

github-actions bot commented Mar 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Mar 4, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the 512.f come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from dynamic_per_token_scaled_fp8_quant_kernel

float const min_scaling_factor = 1.0f / (FP8_E4M3_MAX * 512.f);

mgoin and others added 2 commits March 8, 2025 01:29
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the per_token_group_quant_fp8-cuda-kernel branch from 7905345 to 314d1a8 Compare March 8, 2025 02:23
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson
Copy link
Collaborator

Seeing a minor boost over: #14476 when combined with that

  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
3    vllm          1000           1000     982.249267  0.982249      42.002411     2403.797040
2    vllm          5000           1000     521.687045  0.521687      38.893414     6001.047062
4    vllm         10000           1000     331.643844  0.331644      36.228126    54635.778229
1    vllm         32000           1000     113.265287  0.113265      36.442538   183086.203407

@mergify
Copy link

mergify bot commented Apr 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 23, 2025
mgoin added 2 commits May 29, 2025 03:27
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
@mergify mergify bot removed the needs-rebase label May 29, 2025
@mergify mergify bot added the performance Performance-related issues label Jun 23, 2025
@mgoin mgoin closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants