Skip to content

Conversation

@gbyu-amd
Copy link

@gbyu-amd gbyu-amd commented Oct 14, 2025

Change to TP8 since it gives better performance than TP8 + EP8 for now.
Change 3500 to 3.5 * 1024.

TP8 lm_eval result:
image

@wuhuikx
Copy link

wuhuikx commented Oct 14, 2025

Please follow this instruction https://docs.vllm.ai/en/latest/contributing/index.html#linting and add the link in the README

python3 -m pip install pre-commit
pre-commit install

pre-commit run --hook-stage manual markdownlint
pre-commit run --hook-stage manual mypy-3.12

@tjtanaavllm
Copy link

tjtanaavllm commented Oct 14, 2025

@gbyu-amd
We have just merged [Feat][aiter][ROCm] Add aiter rmsnorm and ptpc fp8 quant fusion #735

This will improve the perf further

--compilation-config '{"pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"]}'

Could we also enable this in this PR?

There are issues on mi300X when using newer AITER, KF could only test this Qwen3-Coder-PTPC-FP8 model and saw improvements.

local-completions (model=EmbeddedLLM/Qwen3-Coder-480B-A35B-Instruct-FP8-Dynamic,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8886|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.8650|±  |0.0094|
Metric Without Fused RMS Norm With Fused RMS Norm Difference % Change
Overall Performance
Successful requests 640 640 0 0%
Benchmark duration (s) 470.49 462.86 -7.63 -1.62%
Request throughput (req/s) 1.36 1.38 +0.02 +1.47%
Token Throughput
Output token throughput (tok/s) 1392.93 1415.90 +22.97 +1.65%
Peak output token throughput (tok/s) 2304.00 2368.00 +64.00 +2.78%
Total token throughput (tok/s) 6268.20 6371.54 +103.34 +1.65%
Concurrency
Peak concurrent requests 71.00 75.00 +4.00 +5.63%
Time to First Token (TTFT)
Mean TTFT (ms) 2281.44 2544.96 +263.52 +11.55%
Median TTFT (ms) 2014.27 2116.78 +102.51 +5.09%
P99 TTFT (ms) 11940.64 11891.41 -49.23 -0.41%
Time Per Output Token (TPOT)
Mean TPOT (ms) 43.73 42.73 -1.00 -2.29%
Median TPOT (ms) 44.62 43.40 -1.22 -2.73%
P99 TPOT (ms) 46.01 45.53 -0.48 -1.04%
Inter-token Latency (ITL)
Mean ITL (ms) 43.73 42.73 -1.00 -2.29%
Median ITL (ms) 28.65 28.33 -0.32 -1.12%
P99 ITL (ms) 685.79 672.58 -13.21 -1.93%

guanbao added 2 commits October 14, 2025 13:49
Signed-off-by: guanbao <gyu@amd.com>
@tjtanaavllm
Copy link

tjtanaavllm commented Oct 14, 2025

LGTM. We will fix the other pre-commit in another PR.
I will wait for your updated command.

Signed-off-by: guanbao <gyu@amd.com>
@gbyu-amd
Copy link
Author

[Feat][aiter][ROCm] Add aiter rmsnorm and ptpc fp8 quant fusion #735

after adding the compilation config to enable rmsnorm+quant fusion, seems there is acc issue with deepseek ptpc:
image

Full server cmd:

export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1

# original weight  https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic
model_path="/mnt/raid0/guanbao/EmbeddedLLM/deepseek-r1-FP8-Dynamic"

vllm serve $model_path \
  --tensor-parallel-size 8 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"]}' \
  --gpu_memory_utilization 0.9 \
  --block-size 1 \

lm_eval cmd:

#!/bin/bash
model="/mnt/raid0/guanbao/EmbeddedLLM/deepseek-r1-FP8-Dynamic"
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args model=${model},base_url=http://127.0.0.1:8000/v1/completions \
--batch_size 100

@tjtanaavllm tjtanaavllm self-requested a review October 14, 2025 12:37
@tjtanaavllm
Copy link

LGTM. The bug will be address in another PR.

@tjtanaavllm tjtanaavllm merged commit 6e1c93e into dev/perf Oct 14, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants