-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix] fix tmp_out and exp_sums dimensions #17438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] fix tmp_out and exp_sums dimensions #17438
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hliuca . Do you have a test and/or example that triggers a failure without this PR?
csrc/rocm/attention.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you update the shape comments for the other kernels in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here are the commands to reproduce. The query.shape[0] is way larger than block_tables.shape[0] (the correct one from cuda/hip source code). Thank you.
server side commands,
docker.io/rocm/vllm-dev:nightly_main_20250420
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_MIN_NCHANNELS=112
export VLLM_FP8_PADDING=1
export VLLM_FP8_ACT_PADDING=1
export VLLM_FP8_WEIGHT_PADDING=1
export VLLM_FP8_REDUCE_CONV=1
export HIP_FORCE_DEV_KERNARG=1
export VLLM_USE_V1=1
vllm serve /data/huggingface/hub/amd/Llama-3.3-70B-Instruct-FP8-KV --dtype float16 --tensor-parallel-size 1 --kv-cache-dtype auto --quantization None --swap-space 16 --distributed-executor-backend mp --max-num-seqs 64 --max-model-len 16384 --max-seq-len-to-capture 16384 --max-num-batched-tokens 131072 --no-enable-prefix-caching --enable-chunked-prefill=False --disable-log-requests --uvicorn-log-level warning --port 8000
client side command:
python3 /app/vllm/benchmarks/benchmark_serving.py --host localhost --backend openai --port 8000 --model /data/huggingface/hub/amd/Llama-3.3-70B-Instruct-FP8-KV --dataset-name random --num-prompts 24 --random-input-len 8500 --random-output-len 150 --max-concurrency 8 --percentile-metrics ttft,tpot,itl,e2el
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>
| const int q_stride, | ||
| const int kv_block_stride, | ||
| const int kv_head_stride, | ||
| float* __restrict__ exp_sums, // [num_seqs, num_heads, max_num_partitions] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking at this again, I think the problem is that this file is conflating num_tokens with num_seqs. num_tokens is the outermost dimension in the query and output tensors and num_seqs is the outermost dimension in the block_table. So what I'm saying is that we should update the query/output shape comments to be [num_tokens, ....] and leave the other shape arguments alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gshtras and I will work together to address the comments in the source code. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conclusion is that for V0 and V1 the kernel should be called with different value for num_seqs. But in the kernel itself the value does represent number of sequences, so we'll revert the comment change, leaving just the V1 callsite change
| float* __restrict__ exp_sums, // [block_tables.size(0), num_heads, max_num_partitions] | ||
| float* __restrict__ max_logits, // [block_tables.size(0), num_heads, max_num_partitions] | ||
| scalar_t* __restrict__ out, // [block_tables.size(0), num_heads, max_num_partitions, head_size] | ||
| OUTT* __restrict__ final_out, // [num_seqs, num_heads, head_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like final_out is a dead argument? Meaning, I don't see it used anywhere in this file. Are we sure these kernels are actually used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this one is not related to the PR. We can run a round of cleanups in a follow up
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>
|
@SageMoore I hope the above answers your questions |
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com>
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
The first dimension of tmp_out and exp_sums is inferred from block_tables.size(0), which may be different from query.shape(0). The later can be much larger than block_tables.size(0), which may cause OOM.
This PR fix the total_num_seq and the comments.