[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12501

tjtanaa · 2025-01-28T06:44:26Z

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Co-authored by: @kliuae

Note: This PR feature requires ROCm 6.3 and later and GPU Arch MI300 and later.

Description

This PR involves the following enhancements

This is a PR specific to support Per-Token-Activation Per-Channel-Weight (PTPC-FP8) FP8 Quantization Inferencing.
The model will be quantized on-the-fly from BFloat16 to FP8. Model weight which are store in Float16 will need to be casted into BFloat16.
It used PyTorch latest rowwise scaled GEMM feature in torch._scaled_mm which is introduced in [ROCm] hipblaslt rowwise f8 gemm pytorch/pytorch#144432 , which speeds up current naive implementation by at least 2 times. For more details check out the Performance section

To support this feature, the Dockerfile.rocm_base PyTorch repo commit has been updated to 3a585126.
Dockerfile.rocm is left untouched as the base image is referencing to AMD docker hub registry. That base image at this point in time has already installed with PyTorch repo commit 3a585126.

Small enhancement. The documentation has been updated to ROCm 6.3 and various commits in the installation step has been updated to match the commits in Dockerfile.rocm_base.

Performance

Perplexity Test

Model: Llama-3.1-8B-Instruct
Dataset: Wikitexts
GPU: MI300X

Model	Quantization	KVCacheDtype	Tasks	Metric	Metric Score
Llama-3.1-8B-Instruct/	auto (bf16)	auto (bf16)	wikitext	word_perplexity	9.4281
Llama-3.1-8B-Instruct/	fp8	fp8_e4m3	wikitext	word_perplexity	9.5124
Llama-3.1-8B-Instruct/	ptpc_fp8	fp8_e4m3	wikitext	word_perplexity	9.5093
Llama-3.1-8B-Instruct/	ptpc_fp8 (naive)	fp8_e4m3	wikitext	word_perplexity	9.5095

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Model: Llama-3.1-70B-Instruct
Dataset: SharedGPT
GPU: 1xMI300X

Quantization	KVCacheDType	Req/s	Total token/s	Output tokens/s
ptpc_fp8 (naive)	fp8_e4m3	2.43	1003.46	481.28
ptpc_fp8 (torch._scaled_mm rowwise scaled GEMM feature)	fp8_e4m3	6.36	2631.04	1261.91

PTPC_FP8 (naive)


  # Making sure the dummy tensor is on the same device as the weight
  global TORCH_DEVICE_IDENTITY
  if TORCH_DEVICE_IDENTITY.device != weight.device:
      TORCH_DEVICE_IDENTITY = TORCH_DEVICE_IDENTITY.to(weight.device)

  # GEMM
  # This computes C = (X * W).
  # Output in fp32 to allow subsequent ops to happen in-place
  output = torch._scaled_mm(qinput,
                            weight,
                            scale_a=TORCH_DEVICE_IDENTITY,
                            scale_b=TORCH_DEVICE_IDENTITY,
                            out_dtype=torch.float32)
  # A fix for discrepancy in scaled_mm which returns tuple
  # for torch < 2.5 and a single value in torch >= 2.5
  if type(output) is tuple and len(output) == 2:
      output = output[0]
  # Unpad (undo num_token_padding)
  output = torch.narrow(output, 0, 0, input_2d.shape[0])
  x_scale = torch.narrow(x_scale, 0, 0, input_2d.shape[0])

  # DQ
  # C = sw * sx * (X * W) + bias
  output = output * x_scale * weight_scale.t()

Additional Side Fixes while working on this PR

Fix FP8 unit tests for ROCm

@pytest.mark.skipif(not is_quant_method_supported("fp8"),
                    reason="FP8 is not supported on this GPU type.")
@pytest.mark.parametrize("model_id", KV_CACHE_MODELS)
def test_kv_cache_model_load_and_run(vllm_runner, model_id: str):
    with vllm_runner(model_id, kv_cache_dtype="fp8") as llm:

        def check_model(model):
            attn = model.model.layers[0].self_attn.attn

            assert isinstance(attn.quant_method, Fp8KVCacheMethod)

+            if not current_platform.is_rocm():
                # NOTE: This code path requires validation on Non-CUDA platform
                # NOTE: it is valid for scales to be 1.0 (default value), but
                # we know these checkpoints have scales < 1.0
                assert 0.0 < attn._k_scale < 1.0
                assert 0.0 < attn._v_scale < 1.0
+            else:
+                # NOTE: This code path is for ROCm platform
+                # NOTE: it is valid for scales to be 1.0 (default value), but
+                # we know these checkpoints have scales < 1.0
+                # However on ROCm platform, the _k_scale and _v_scale will be
+                # scaled by a factor of 2 as described in
+                # vllm/model_executor/layers/quantization/kv_cache.py
+                assert 0.0 < attn._k_scale < (1.0 * 2.0)
+                assert 0.0 < attn._v_scale < (1.0 * 2.0)

        llm.apply_model(check_model)

        # note: this does not test accuracy, just that we can run through
        # see lm-eval tests for accuracy
        outputs = llm.generate_greedy(prompts=["Hello my name is"],
                                      max_tokens=10)
        print(outputs[0][1])

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…lation readme to point to ROCm6.5 Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

github-actions · 2025-01-28T06:44:36Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mgoin

Thanks for your contribution. Do you think it would be possible to implement this inside of fp8.py? It seems we could just change the default --quantization fp8 on an unquantized model to use per-token and per-channel. Given the cutlass and pytorch support we have now, I don't think there is a great reason to rely on per-tensor by default anymore

tjtanaa · 2025-01-29T13:51:56Z

Thanks for your contribution. Do you think it would be possible to implement this inside of fp8.py? It seems we could just change the default --quantization fp8 on an unquantized model to use per-token and per-channel. Given the cutlass and pytorch support we have now, I don't think there is a great reason to rely on per-tensor by default anymore

@mgoin We think it should be possible to implement PTPC-FP8 inside of fp8.py and force the --quantization fp8 on an unquantized model to use per-token and per-channel on ROCm. The default behavior of --quantization fp8 on NVIDIA GPU would require an additional PR to resolve it.

@hongxiayang @mgoin Maybe we could get the input from AMD to check if there is any preference or demand in maintaining a per-tensor quantization for backward compatibility.

Since we are on this topic, I remember making vLLM production ready is a goal, I wonder if there is a need for us to maintain certain backward compatibility so that the behavior of features does not change as much as possible?
Moreover, when we were adding this new quantization feature and wanted to add documentation about the quantization feature e.g. behavior, usage, expectation, we couldn't find a page for it. I wonder if there is any RFC for documentation about quantization approach?

mgoin

After offline discussion, it would be best to keep this separate from the fp8 quantization and treat the Nvidia case separately. I think this is good to land with these last requests

tests/quantization/test_fp8.py

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

… code path; add skip test comment Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa · 2025-02-05T03:00:57Z

@mgoin I have addressed the comments. Is it ready to for merging?

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

kliuae and others added 3 commits January 28, 2025 06:32

Add ptpc-fp8 quantization

8c61544

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Enable torch._scaled_mm rowwise gemm fp8

d86df2d

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Update PyTorch version in Dockerfile.rocm_base; Update AMD GPU instal…

4d57881

…lation readme to point to ROCm6.5 Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners January 28, 2025 06:44

mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 28, 2025

tjtanaa mentioned this pull request Jan 28, 2025

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

Closed

add ptpc fp8 unittests

ef98cef

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mgoin reviewed Jan 28, 2025

View reviewed changes

hongxiayang added the rocm label Jan 28, 2025

mgoin reviewed Jan 31, 2025

View reviewed changes

tests/quantization/test_fp8.py Outdated Show resolved Hide resolved

tests/quantization/test_fp8.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/utils/w8a8_utils.py Outdated Show resolved Hide resolved

fix test_fp8.py::test_kv_cache_model_load_and_run; remove unnecessary…

0f309c2

… code path; add skip test comment Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa changed the title ~~[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing~~ [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing Feb 1, 2025

tjtanaa added 2 commits February 3, 2025 03:42

Merge remote-tracking branch 'origin/main' into ptpc-fp8-rocm-2

004dadb

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

format lint code

73d7bd1

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 7, 2025

mgoin approved these changes Feb 7, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/w8a8_utils.py Outdated Show resolved Hide resolved

tjtanaa added 2 commits February 7, 2025 06:40

Merge remote-tracking branch 'origin/main' into ptpc-fp8-rocm-2

12f42de

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

introduce USE_ROWWISE_TORCH_SCALED_MM

881ce38

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

DarkLight1337 enabled auto-merge (squash) February 7, 2025 14:44

DarkLight1337 disabled auto-merge February 7, 2025 14:44

DarkLight1337 enabled auto-merge (squash) February 7, 2025 14:44

simon-mo merged commit eaa92d4 into vllm-project:main Feb 7, 2025
70 of 72 checks passed

AoyuQC pushed a commit to AoyuQC/vllm that referenced this pull request Feb 8, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activa…

9053391

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activa…

529f3dd

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

panf2333 pushed a commit to yottalabsai/vllm that referenced this pull request Feb 18, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activa…

20508fb

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

kerthcet pushed a commit to kerthcet/vllm that referenced this pull request Feb 21, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activa…

11d0a39

…tion Per-Channel-Weight FP8 Quantization Inferencing (vllm-project#12501)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12501

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12501

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 28, 2025

mgoin left a comment

tjtanaa commented Jan 29, 2025 •

edited

Loading

mgoin left a comment

tjtanaa commented Feb 5, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12501

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12501

Conversation

tjtanaa commented Jan 28, 2025 • edited by github-actions bot Loading

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Description

Performance

Perplexity Test

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Additional Side Fixes while working on this PR

Fix FP8 unit tests for ROCm

github-actions bot commented Jan 28, 2025

mgoin left a comment

Choose a reason for hiding this comment

tjtanaa commented Jan 29, 2025 • edited Loading

mgoin left a comment

Choose a reason for hiding this comment

tjtanaa commented Feb 5, 2025

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Jan 29, 2025 •

edited

Loading