[Bugfix] fix tmp_out and exp_sums dimensions #17438

hliuca · 2025-04-30T03:52:47Z

The first dimension of tmp_out and exp_sums is inferred from block_tables.size(0), which may be different from query.shape(0). The later can be much larger than block_tables.size(0), which may cause OOM.

This PR fix the total_num_seq and the comments.

github-actions · 2025-04-30T03:52:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

SageMoore

Hi @hliuca . Do you have a test and/or example that triggers a failure without this PR?

SageMoore · 2025-04-30T16:32:03Z

csrc/rocm/attention.cu

Nit: Can you update the shape comments for the other kernels in this file?

Yes, here are the commands to reproduce. The query.shape[0] is way larger than block_tables.shape[0] (the correct one from cuda/hip source code). Thank you.

server side commands,

docker.io/rocm/vllm-dev:nightly_main_20250420

export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_MIN_NCHANNELS=112
export VLLM_FP8_PADDING=1
export VLLM_FP8_ACT_PADDING=1
export VLLM_FP8_WEIGHT_PADDING=1
export VLLM_FP8_REDUCE_CONV=1
export HIP_FORCE_DEV_KERNARG=1
export VLLM_USE_V1=1

vllm serve /data/huggingface/hub/amd/Llama-3.3-70B-Instruct-FP8-KV --dtype float16 --tensor-parallel-size 1 --kv-cache-dtype auto --quantization None --swap-space 16 --distributed-executor-backend mp --max-num-seqs 64 --max-model-len 16384 --max-seq-len-to-capture 16384 --max-num-batched-tokens 131072 --no-enable-prefix-caching --enable-chunked-prefill=False --disable-log-requests --uvicorn-log-level warning --port 8000

client side command:
python3 /app/vllm/benchmarks/benchmark_serving.py --host localhost --backend openai --port 8000 --model /data/huggingface/hub/amd/Llama-3.3-70B-Instruct-FP8-KV --dataset-name random --num-prompts 24 --random-input-len 8500 --random-output-len 150 --max-concurrency 8 --percentile-metrics ttft,tpot,itl,e2el

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

SageMoore · 2025-05-01T13:54:57Z

csrc/rocm/attention.cu

    const int q_stride,
    const int kv_block_stride,
    const int kv_head_stride,
-    float* __restrict__ exp_sums,           // [num_seqs, num_heads, max_num_partitions]


Ok looking at this again, I think the problem is that this file is conflating num_tokens with num_seqs. num_tokens is the outermost dimension in the query and output tensors and num_seqs is the outermost dimension in the block_table. So what I'm saying is that we should update the query/output shape comments to be [num_tokens, ....] and leave the other shape arguments alone.

@gshtras and I will work together to address the comments in the source code. Thank you.

The conclusion is that for V0 and V1 the kernel should be called with different value for num_seqs. But in the kernel itself the value does represent number of sequences, so we'll revert the comment change, leaving just the V1 callsite change

SageMoore · 2025-05-01T13:59:13Z

csrc/rocm/attention.cu

+    float* __restrict__ exp_sums,           // [block_tables.size(0), num_heads, max_num_partitions]
+    float* __restrict__ max_logits,         // [block_tables.size(0), num_heads, max_num_partitions]
+    scalar_t* __restrict__ out,             // [block_tables.size(0), num_heads, max_num_partitions, head_size]
    OUTT* __restrict__ final_out,           // [num_seqs, num_heads, head_size]


It looks like final_out is a dead argument? Meaning, I don't see it used anywhere in this file. Are we sure these kernels are actually used?

Looks like this one is not related to the PR. We can run a round of cleanups in a follow up

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

gshtras · 2025-05-01T21:39:56Z

@SageMoore I hope the above answers your questions

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

SageMoore reviewed Apr 30, 2025

View reviewed changes

hliuca added 2 commits April 30, 2025 12:58

fix tmp_out and exp_sums dimensions

30b30b8

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

fix more comments

e6f65c4

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

hliuca changed the title ~~fix tmp_out and exp_sums dimensions~~ [Bugfix] fix tmp_out and exp_sums dimensions May 1, 2025

SageMoore reviewed May 1, 2025

View reviewed changes

undo the comment

4583fba

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

SageMoore approved these changes May 2, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) May 2, 2025 15:01

robertgshaw2-redhat approved these changes May 2, 2025

View reviewed changes

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 2, 2025

mgoin approved these changes May 2, 2025

View reviewed changes

mgoin added the bug Something isn't working label May 2, 2025

robertgshaw2-redhat merged commit 4c33d67 into vllm-project:main May 2, 2025
71 checks passed

hliuca deleted the fix_chunked_prefill branch May 2, 2025 16:54

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Bugfix] fix tmp_out and exp_sums dimensions (vllm-project#17438)

86f0cb5

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[Bugfix] fix tmp_out and exp_sums dimensions (vllm-project#17438)

3da9787

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Bugfix] fix tmp_out and exp_sums dimensions (vllm-project#17438)

6ab722d

Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] fix tmp_out and exp_sums dimensions #17438

[Bugfix] fix tmp_out and exp_sums dimensions #17438

Uh oh!

hliuca commented Apr 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

SageMoore left a comment

Uh oh!

SageMoore Apr 30, 2025

Uh oh!

hliuca Apr 30, 2025 •

edited

Loading

Uh oh!

SageMoore May 1, 2025

Uh oh!

hliuca May 1, 2025

Uh oh!

gshtras May 1, 2025

Uh oh!

SageMoore May 1, 2025

Uh oh!

gshtras May 1, 2025

Uh oh!

gshtras commented May 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Bugfix] fix tmp_out and exp_sums dimensions #17438

[Bugfix] fix tmp_out and exp_sums dimensions #17438

Uh oh!

Conversation

hliuca commented Apr 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

hliuca Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

docker.io/rocm/vllm-dev:nightly_main_20250420

Uh oh!

SageMoore May 1, 2025

Choose a reason for hiding this comment

Uh oh!

hliuca May 1, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras May 1, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore May 1, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras May 1, 2025

Choose a reason for hiding this comment

Uh oh!

gshtras commented May 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hliuca commented Apr 30, 2025 •

edited by github-actions bot

Loading

hliuca Apr 30, 2025 •

edited

Loading