Enable block-size > 1 by enable block table mapping #745

ganyi1996ppo · 2025-10-20T08:20:29Z

Purpose

Enable block-size > 1 by enable block table mapping

Test Plan

Test Result

# launch script

export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1

# for profiling
# export VLLM_TORCH_PROFILER_DIR="deepseek_in3k_out1k"
# export VLLM_TORCH_PROFILER_WITH_STACK=1
# export VLLM_TORCH_PROFILER_RECORD_SHAPES=1

model_path="/mnt/raid0/zhangguopeng/deepseek-r1-FP8-Dynamic"
vllm serve $model_path \
  --tensor-parallel-size 8 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --gpu_memory_utilization 0.9 \
  --block-size 128 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

# gsm8k test
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9507|±  |0.0060|
|     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <ygan@amd.com>

tjtanaavllm · 2025-10-21T07:37:49Z

My understanding is that AITER MLA decode is optimized for block-size 1.

Based on this understanding, I have three questions regarding to this PR:

Is this for DeepSeek V3.2?
Will there be performance gain when block-size > 1?
Are there new MLA decode kernels that support block-size > 1?

tjtanaavllm · 2025-10-22T05:17:39Z

The best for overall throughput would still be to set block-size=1.

Here's a comparison table of the benchmark results on DeepSeek-R1 PTPC FP8:

General Metrics

Metric	Before PR (Block-size 1)	After PR (Block-size 1)	After PR (Block-size 16)	Best Performance
General Performance
Successful requests	320	320	320	All equal
Benchmark duration (s)	359.51	354.92	360.31	Block-size 1
Total generated tokens	298,762	294,597	300,636	Block-size 16
Request throughput (req/s)	0.89	0.90	0.89	Block-size 1
Output token throughput (tok/s)	831.03	830.05	834.37	Block-size 16
Peak output token throughput (tok/s)	1,056.00	1,088.00	1,088.00	Block-size 1 & 16
Total token throughput (tok/s)	4,020.30	4,060.55	4,016.48	Block-size 1

Latency Metrics

Latency Metrics	Before PR	After PR (Block-size 1)	After PR (Block-size 16)	Best Performance
Mean TTFT (ms)	1,923.96	1,522.36	1,742.48	After PR (Block-size 1)
Median TTFT (ms)	1,686.80	1,411.06	1,655.13	After PR (Block-size 1)
P99 TTFT (ms)	5,553.48	5,530.39	5,531.85	After PR (Block-size 1)
Mean TPOT (ms)	57.56	53.10	58.82	After PR (Block-size 1)
Median TPOT (ms)	35.44	36.28	35.15	After PR (Block-size 16)
P99 TPOT (ms)	721.81	805.45	647.23	After PR (Block-size 16)
Mean ITL (ms)	35.03	35.79	34.86	Before PR
Median ITL (ms)	31.29	31.43	31.02	After PR (Block-size 16)
P99 ITL (ms)	209.77	211.47	210.17	Before PR

Workload

#!/bin/bash
PORT=8000
SEED=0
CONCURRENCY=32
NREQUESTS=$(($CONCURRENCY * 10))
ISL=3584
OSL=1024
vllm bench serve --backend vllm \
--model EmbeddedLLM/deepseek-r1-FP8-Dynamic \
--dataset-name random \
--num-prompts ${NREQUESTS} \
--random-input ${ISL} \
--random-output ${OSL} \
--seed ${SEED} \
--max-concurrency ${CONCURRENCY} --port ${PORT} \
| tee afterblocksizelgt1_blocksize16_EmbeddedLLM_deepseek-r1-FP8-Dynamic_v1_random_isl${ISL}_osl${OSL}_con${CONCURRENCY}.log

tjtanaavllm

LGTM. Amazing feature.

support block-size > 1 for mla by remapping block table

fbf26ca

Signed-off-by: ganyi <ygan@amd.com>

ganyi1996ppo requested review from kliuae-amd, tjtanaavllm, wuhuikx and zejunchen-zejun as code owners October 20, 2025 08:20

tjtanaavllm approved these changes Oct 22, 2025

View reviewed changes

tjtanaavllm merged commit 57faa0c into dev/perf Oct 22, 2025
4 of 5 checks passed

tjtanaavllm mentioned this pull request Oct 22, 2025

Dev/perf sync #754

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable block-size > 1 by enable block table mapping #745

Enable block-size > 1 by enable block table mapping #745

Uh oh!

ganyi1996ppo commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

tjtanaavllm commented Oct 21, 2025

Uh oh!

tjtanaavllm commented Oct 22, 2025 •

edited

Loading

Uh oh!

tjtanaavllm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable block-size > 1 by enable block table mapping #745

Enable block-size > 1 by enable block table mapping #745

Uh oh!

Conversation

ganyi1996ppo commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tjtanaavllm commented Oct 21, 2025

Uh oh!

tjtanaavllm commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General Metrics

Latency Metrics

Workload

Uh oh!

tjtanaavllm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ganyi1996ppo commented Oct 20, 2025 •

edited by github-actions bot

Loading

tjtanaavllm commented Oct 22, 2025 •

edited

Loading