[Feature] Non-contiguous Support for FP8 Quantization #21961

yewentao256 · 2025-07-30T18:04:10Z

Purpose

Fix #19630

using common vectorization utils
support non-contiguous input (last stride == 1)
empty instead of zeros

Test

Acc

lm_eval --model vllm --model_args "pretrained=nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8,max_model_len=32768,enable_expert_parallel=True,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

# Now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7574|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7354|±  |0.0122|

# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7574|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.7354|±  |0.0122|

Perf

vllm bench throughput --model nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8 --input-len 1000 --output-len 100 --trust_remote_code --enforce_eager

# Now
Throughput: 49.45 requests/s, 54306.45 total tokens/s, 4944.88 output tokens/s
# main
Throughput: 48.92 requests/s, 53730.95 total tokens/s, 4892.48 output tokens/s

Signed-off-by: yewentao256 <zhyanwentao@126.com>

github-actions · 2025-07-30T18:04:17Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for non-contiguous tensors in FP8 quantization kernels. However, the implementation of segmented_max_reduction_strided has critical correctness and performance issues that need to be addressed before merging.

csrc/quantization/fp8/common.cu

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin

Nice work! Can you add at least one case to the unit test that expresses this type of non-contiguous input?

vllm/_custom_ops.py

csrc/quantization/fused_kernels/layernorm_utils.cuh

csrc/quantization/fp8/common.cu

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin · 2025-08-01T16:54:10Z

@yewentao256 can you add a simple unit test for this before merge?

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-08-01T21:10:45Z

@yewentao256 can you add a simple unit test for this before merge?

Thanks for the comment! Done

(.wentao_env) wentao@gpu66:~/vllm-source/tests/quantization$ pytest test_fp8.py::test_scaled_fp8_quant
================================ test session starts ================================
platform linux -- Python 3.12.3, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/wentao/vllm-source
configfile: pyproject.toml
plugins: asyncio-1.0.0, anyio-4.9.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                   

test_fp8.py ..                                                                [100%]

================================= 2 passed in 2.12s =================================

…-fp8-quant

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

yewentao256 added 6 commits July 29, 2025 14:33

init version

0f6ae43

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wye-apply-vectorization-util-to-fp8-quant

728d993

Merge branch 'main' into wye-apply-vectorization-util-to-fp8-quant

91711e9

non-contiguous support

2e3b778

Signed-off-by: yewentao256 <zhyanwentao@126.com>

use strided only

9f5d4ab

Signed-off-by: yewentao256 <zhyanwentao@126.com>

empty instead of zeros

51e9688

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist bot reviewed Jul 30, 2025

View reviewed changes

csrc/quantization/fp8/common.cu Show resolved Hide resolved

yewentao256 added 2 commits July 30, 2025 14:19

fix segmented_max_reduction_strided issue

040706f

Signed-off-by: yewentao256 <zhyanwentao@126.com>

fix int64 overflow

219b9bb

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin self-assigned this Jul 30, 2025

mgoin requested review from LucasWilkinson and mgoin July 30, 2025 19:43

mgoin reviewed Jul 30, 2025

View reviewed changes

yewentao256 added 4 commits July 30, 2025 16:40

update through comments

698e634

Signed-off-by: yewentao256 <zhyanwentao@126.com>

use int64_t

53e0f2c

Signed-off-by: yewentao256 <zhyanwentao@126.com>

update 0

13be9a2

Signed-off-by: yewentao256 <zhyanwentao@126.com>

add back comments

c56ac2b

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 added 2 commits August 1, 2025 17:02

Merge branch 'main' into wye-apply-vectorization-util-to-fp8-quant

0924f14

add unit test

7dd1586

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested a review from robertgshaw2-redhat as a code owner August 1, 2025 21:10

mgoin approved these changes Aug 1, 2025

View reviewed changes

mgoin added performance Performance-related issues deepseek Related to DeepSeek models labels Aug 1, 2025

mgoin enabled auto-merge (squash) August 1, 2025 21:19

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2025

mgoin and others added 2 commits August 2, 2025 16:51

Merge branch 'main' into wye-apply-vectorization-util-to-fp8-quant

dce498b

Merge branch 'vllm-project:main' into wye-apply-vectorization-util-to…

0679880

…-fp8-quant

vllm-bot merged commit 4771df7 into vllm-project:main Aug 5, 2025
72 of 74 checks passed

gshtras mentioned this pull request Aug 5, 2025

[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm #22264

Merged

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Feature] Non-contiguous Support for FP8 Quantization (vllm-project#2…

61d0a0a

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

myselvess pushed a commit to myselvess/vllm that referenced this pull request Aug 7, 2025

[Feature] Non-contiguous Support for FP8 Quantization (vllm-project#2…

cadd196

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

gshtras mentioned this pull request Aug 13, 2025

Disable soft fail on AMD tests vllm-project/ci-infra#139

Merged

yewentao256 mentioned this pull request Aug 19, 2025

Make sure that vectorize_with_alignment produced vectorized global loads #23182

Merged

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Feature] Non-contiguous Support for FP8 Quantization (vllm-project#2…

ca4359c

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Feature] Non-contiguous Support for FP8 Quantization (vllm-project#2…

6d2bbab

…1961) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mgoin <mgoin64@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] Non-contiguous Support for FP8 Quantization #21961

[Feature] Non-contiguous Support for FP8 Quantization #21961

Uh oh!

yewentao256 commented Jul 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin commented Aug 1, 2025

Uh oh!

yewentao256 commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Feature] Non-contiguous Support for FP8 Quantization #21961

[Feature] Non-contiguous Support for FP8 Quantization #21961

Uh oh!

Conversation

yewentao256 commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Acc

Perf

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin commented Aug 1, 2025

Uh oh!

yewentao256 commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewentao256 commented Jul 30, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading