-
Notifications
You must be signed in to change notification settings - Fork 50
[script] run with pure TP8 #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please follow this instruction https://docs.vllm.ai/en/latest/contributing/index.html#linting and add the link in the README python3 -m pip install pre-commit pre-commit run --hook-stage manual markdownlint |
|
@gbyu-amd This will improve the perf further
Could we also enable this in this PR? There are issues on mi300X when using newer AITER, KF could only test this Qwen3-Coder-PTPC-FP8 model and saw improvements.
|
Signed-off-by: guanbao <gyu@amd.com>
|
LGTM. We will fix the other pre-commit in another PR. |
Signed-off-by: guanbao <gyu@amd.com>
403923a to
bc177bd
Compare
after adding the compilation config to enable rmsnorm+quant fusion, seems there is acc issue with deepseek ptpc: Full server cmd: export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1
# original weight https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic
model_path="/mnt/raid0/guanbao/EmbeddedLLM/deepseek-r1-FP8-Dynamic"
vllm serve $model_path \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--trust-remote-code \
--no-enable-prefix-caching \
--disable-log-requests \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "pass_config": {"enable_fusion": true, "enable_noop": true, "enable-attn-fusion": false}, "custom_ops": ["+rms_norm", "+quant_fp8"]}' \
--gpu_memory_utilization 0.9 \
--block-size 1 \lm_eval cmd: #!/bin/bash
model="/mnt/raid0/guanbao/EmbeddedLLM/deepseek-r1-FP8-Dynamic"
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args model=${model},base_url=http://127.0.0.1:8000/v1/completions \
--batch_size 100 |
|
LGTM. The bug will be address in another PR. |

Change to TP8 since it gives better performance than TP8 + EP8 for now.
Change 3500 to 3.5 * 1024.
TP8 lm_eval result:
