[BugFix] Pad input buffers in _dummy_run #26209

varun-sundar-rabindranath · 2025-10-04T02:26:59Z

Purpose

PR #24845 removed the padding. This PR re-introduce the padding.

Test Plan

server:
ALL2ALL Backend Naive:

vllm serve deepseek-ai/DeepSeek-V2-Lite-Chat     --disable-log-requests --no-enable-prefix-caching -tp 1 -dp 2 --max-num-seqs 256     --enable-expert-parallel --load-format dummy --gpu-memory-utilization 0.85

DBO:

NCCL_P2P_DISABLE=1 VLLM_ALL2ALL_BACKEND=deepep_low_latency  vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --max-num-seqs 512 --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --disable-log-requests --enable-dbo --dbo-decode-token-threshold 4

DBO + small cudagraph size:

NCCL_P2P_DISABLE=1 VLLM_ALL2ALL_BACKEND=deepep_low_latency  vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --max-num-seqs 512 --trust-remote-code --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --disable-log-requests --enable-dbo --dbo-decode-token-threshold 4 --cuda-graph-sizes 4

client

vllm bench serve --port 8000 --model ${MODEL} --dataset-name random --num-prompts 512 --random-input-len 1024 --random-output-len 512 --trust-remote-code

Test Result

ALL2ALL Backend Naive, DBO and DBO + small cudagraph size doesn't deadlock.

gemini-code-assist

Code Review

This pull request re-introduces padding for dummy runs, which is a necessary bug fix. The changes correctly replace num_tokens with num_tokens_after_padding in various places. However, I've identified a potential critical issue where num_tokens_after_padding can exceed the allocated buffer size, leading to an out-of-bounds memory access.

gemini-code-assist · 2025-10-04T02:28:51Z

vllm/v1/worker/gpu_model_runner.py

            if (self.supports_mm_inputs
                    and not self.model_config.is_encoder_decoder):
                input_ids = None
-                inputs_embeds = self.inputs_embeds.gpu[:num_tokens]
+                inputs_embeds = self.inputs_embeds.gpu[:
+                                                       num_tokens_after_padding]
                model_kwargs = {
                    **model_kwargs,
                    **self._dummy_mm_kwargs(num_reqs),
                }
            elif self.enable_prompt_embeds:
                input_ids = None
-                inputs_embeds = self.inputs_embeds.gpu[:num_tokens]
-                model_kwargs = self._init_model_kwargs(num_tokens)
+                inputs_embeds = self.inputs_embeds.gpu[:
+                                                       num_tokens_after_padding]
+                model_kwargs = self._init_model_kwargs(
+                    num_tokens_after_padding)
            else:
-                input_ids = self.input_ids.gpu[:num_tokens]
+                input_ids = self.input_ids.gpu[:num_tokens_after_padding]
                inputs_embeds = None

            if self.uses_mrope:
-                positions = self.mrope_positions.gpu[:, :num_tokens]
+                positions = self.mrope_positions.gpu[:, :
+                                                     num_tokens_after_padding]
            else:
-                positions = self.positions.gpu[:num_tokens]
+                positions = self.positions.gpu[:num_tokens_after_padding]


There is a potential out-of-bounds memory access issue here and in the following lines that use num_tokens_after_padding for slicing. The value of num_tokens_after_padding can exceed self.max_num_tokens, which is the size of buffers like self.input_ids, self.positions, and self.inputs_embeds.

Here's how it can happen in _dummy_run when DBO is enabled:

_dummy_run is called with num_tokens equal to self.max_num_tokens.

ubatch_split is called, which in turn calls get_dp_padding_ubatch.

get_dp_padding_ubatch calculates num_tokens_padded = round_up(num_tokens, 2). If self.max_num_tokens is odd, this results in self.max_num_tokens + 1.

This padded value is used to calculate num_tokens_per_ubatch, which is then communicated across DP ranks. The maximum is taken.

Back in _dummy_run, num_tokens_after_padding is calculated based on the result from ubatch_split, and can become self.max_num_tokens + 1.

Slicing tensors like self.input_ids.gpu[:num_tokens_after_padding] will then result in an out-of-bounds access, which can lead to memory corruption or a crash. This is a critical issue that needs to be addressed.

REVIEWERS, PTAL

I have added an assert num_tokens_after_padding < self.max_num_tokens to address this.
A better fix is to roundup self.max_num_tokens here

vllm/vllm/v1/worker/gpu_model_runner.py

Line 222 in a42d2df

self.max_num_tokens = scheduler_config.max_num_batched_tokens

But we tend to check max_num_tokens against scheduler_config.max_num_batched_tokens in code and it is a reasonable check. This probably needs to be handled more carefully.

However, we generally don't expect max_num_tokens to be odd, and it might never be a issue. But when it does happen, this assert should catch it.

varun-sundar-rabindranath · 2025-10-04T02:31:44Z

Marking this as draft as this might not be the full fix and could introduce bugs in the DBO case.

cc @LucasWilkinson @SageMoore @ProExpertProg @ilmarkov @tlrmchlsmth

SageMoore

The biggest potential gotcha here is that num_tokens_after_padding will be divided in half for the DBO case. I suspect you will want to pad up the inputs to the "full" padded amount. The UbatchWrapper will take care of slicing them down to the ubatch sizes.

This is definitely annoying but once #25768 merges we won't have this difference.

varun-sundar-rabindranath · 2025-10-05T01:06:50Z

Marking this PR as draft (so we dont land it by mistake) as it deadlocks during DBO sanity checks (benchmarking)

varun-sundar-rabindranath · 2025-10-05T01:08:57Z

The biggest potential gotcha here is that num_tokens_after_padding will be divided in half for the DBO case. I suspect you will want to pad up the inputs to the "full" padded amount. The UbatchWrapper will take care of slicing them down to the ubatch sizes.

Thanks @SageMoore . I noticed the update to num_tokens_after_padding just before the model forward pass call. Should we do another slice of the buffers after that update? if we do that, I think it'll be correct for the DBO case and a no-op for the non DBO case.

varun-sundar-rabindranath · 2025-10-06T16:11:04Z

The biggest potential gotcha here is that num_tokens_after_padding will be divided in half for the DBO case. I suspect you will want to pad up the inputs to the "full" padded amount. The UbatchWrapper will take care of slicing them down to the ubatch sizes.

Thanks @SageMoore . I noticed the update to num_tokens_after_padding just before the model forward pass call. Should we do another slice of the buffers after that update? if we do that, I think it'll be correct for the DBO case and a no-op for the non DBO case.

Resolved IRL

mergify · 2025-10-06T16:11:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

SageMoore

This looks reasonable, @varun-sundar-rabindranath. Thanks for the fix!

mgoin

Seems reasonable to me, enabling CI. It would be nice to have a unit test to catalog this behavior as we refactor the model runner

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

varun-sundar-rabindranath requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners October 4, 2025 02:27

varun-sundar-rabindranath marked this pull request as draft October 4, 2025 02:27

mergify bot added the v1 label Oct 4, 2025

gemini-code-assist bot reviewed Oct 4, 2025

View reviewed changes

varun-sundar-rabindranath marked this pull request as ready for review October 4, 2025 16:11

SageMoore reviewed Oct 4, 2025

View reviewed changes

varun-sundar-rabindranath marked this pull request as draft October 5, 2025 01:06

mergify bot added the needs-rebase label Oct 6, 2025

Varun Sundar Rabindranath added 3 commits October 6, 2025 16:13

pad input buffers

3edcca7

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

assert num_tokens_after_padding bounds

3ccf486

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

fixes

f01a7e1

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/fix-dp-padding branch from 5fe0c52 to f01a7e1 Compare October 6, 2025 16:21

mergify bot removed the needs-rebase label Oct 6, 2025

varun-sundar-rabindranath marked this pull request as ready for review October 6, 2025 16:27

SageMoore approved these changes Oct 6, 2025

View reviewed changes

mgoin approved these changes Oct 6, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025

LucasWilkinson merged commit f23b4c0 into vllm-project:main Oct 6, 2025
49 checks passed

mrasquinha-g pushed a commit to mrasquinha-g/vllm that referenced this pull request Oct 9, 2025

[BugFix] Pad input buffers in _dummy_run (vllm-project#26209)

0be9582

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[BugFix] Pad input buffers in _dummy_run (vllm-project#26209)

07935b4

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[BugFix] Pad input buffers in _dummy_run (vllm-project#26209)

35f6d98

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

Uh oh!

[BugFix] Pad input buffers in _dummy_run #26209

[BugFix] Pad input buffers in _dummy_run #26209

Uh oh!

Conversation

varun-sundar-rabindranath commented Oct 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Oct 4, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Oct 5, 2025

Uh oh!

varun-sundar-rabindranath commented Oct 6, 2025

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varun-sundar-rabindranath commented Oct 4, 2025 •

edited by github-actions bot

Loading

varun-sundar-rabindranath commented Oct 5, 2025 •

edited

Loading

mgoin left a comment •

edited

Loading