Skip to content

Conversation

@ganyi1996ppo
Copy link

@ganyi1996ppo ganyi1996ppo commented Oct 16, 2025

Purpose

  • support mla persistent kernel
  • support fp8 mla
    aiter branch: mla_splitkv_enhance_split_alg_inte

Test Plan

Serving script:

export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1

# for profiling
# export VLLM_TORCH_PROFILER_DIR="deepseek_in3k_out1k"
# export VLLM_TORCH_PROFILER_WITH_STACK=1
# export VLLM_TORCH_PROFILER_RECORD_SHAPES=1

model_path="/mnt/raid0/zhangguopeng/deepseek-r1-FP8-Dynamic"
vllm serve $model_path \
  --tensor-parallel-size 8 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --gpu_memory_utilization 0.9 \
  --block-size 1 \
  --kv-cache-dtype fp8 \ # for fp8 cache, remove it if you want bf16 for mla

Test Result

acc test result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9507|±  |0.0060|

acc result for fp8 mla

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.953|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.953|±  |0.0058|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
@sunway513
Copy link

Was this PR also submitted to upstream vLLM?

@ganyi1996ppo
Copy link
Author

ganyi1996ppo commented Oct 23, 2025

Was this PR also submitted to upstream vLLM?

The upstream PR link vllm-project#27380
FYI, the upstream PR will remain draft until the persistent kernel merged into aiter's main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants