-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[BugFix] Pad input buffers in _dummy_run #26209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Pad input buffers in _dummy_run #26209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request re-introduces padding for dummy runs, which is a necessary bug fix. The changes correctly replace num_tokens with num_tokens_after_padding in various places. However, I've identified a potential critical issue where num_tokens_after_padding can exceed the allocated buffer size, leading to an out-of-bounds memory access.
vllm/v1/worker/gpu_model_runner.py
Outdated
| if (self.supports_mm_inputs | ||
| and not self.model_config.is_encoder_decoder): | ||
| input_ids = None | ||
| inputs_embeds = self.inputs_embeds.gpu[:num_tokens] | ||
| inputs_embeds = self.inputs_embeds.gpu[: | ||
| num_tokens_after_padding] | ||
| model_kwargs = { | ||
| **model_kwargs, | ||
| **self._dummy_mm_kwargs(num_reqs), | ||
| } | ||
| elif self.enable_prompt_embeds: | ||
| input_ids = None | ||
| inputs_embeds = self.inputs_embeds.gpu[:num_tokens] | ||
| model_kwargs = self._init_model_kwargs(num_tokens) | ||
| inputs_embeds = self.inputs_embeds.gpu[: | ||
| num_tokens_after_padding] | ||
| model_kwargs = self._init_model_kwargs( | ||
| num_tokens_after_padding) | ||
| else: | ||
| input_ids = self.input_ids.gpu[:num_tokens] | ||
| input_ids = self.input_ids.gpu[:num_tokens_after_padding] | ||
| inputs_embeds = None | ||
|
|
||
| if self.uses_mrope: | ||
| positions = self.mrope_positions.gpu[:, :num_tokens] | ||
| positions = self.mrope_positions.gpu[:, : | ||
| num_tokens_after_padding] | ||
| else: | ||
| positions = self.positions.gpu[:num_tokens] | ||
| positions = self.positions.gpu[:num_tokens_after_padding] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential out-of-bounds memory access issue here and in the following lines that use num_tokens_after_padding for slicing. The value of num_tokens_after_padding can exceed self.max_num_tokens, which is the size of buffers like self.input_ids, self.positions, and self.inputs_embeds.
Here's how it can happen in _dummy_run when DBO is enabled:
_dummy_runis called withnum_tokensequal toself.max_num_tokens.ubatch_splitis called, which in turn callsget_dp_padding_ubatch.get_dp_padding_ubatchcalculatesnum_tokens_padded = round_up(num_tokens, 2). Ifself.max_num_tokensis odd, this results inself.max_num_tokens + 1.- This padded value is used to calculate
num_tokens_per_ubatch, which is then communicated across DP ranks. The maximum is taken. - Back in
_dummy_run,num_tokens_after_paddingis calculated based on the result fromubatch_split, and can becomeself.max_num_tokens + 1.
Slicing tensors like self.input_ids.gpu[:num_tokens_after_padding] will then result in an out-of-bounds access, which can lead to memory corruption or a crash. This is a critical issue that needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEWERS, PTAL
I have added an assert num_tokens_after_padding < self.max_num_tokens to address this.
A better fix is to roundup self.max_num_tokens here
vllm/vllm/v1/worker/gpu_model_runner.py
Line 222 in a42d2df
| self.max_num_tokens = scheduler_config.max_num_batched_tokens |
But we tend to check max_num_tokens against scheduler_config.max_num_batched_tokens in code and it is a reasonable check. This probably needs to be handled more carefully.
However, we generally don't expect max_num_tokens to be odd, and it might never be a issue. But when it does happen, this assert should catch it.
|
Marking this as draft as this might not be the full fix and could introduce bugs in the DBO case. cc @LucasWilkinson @SageMoore @ProExpertProg @ilmarkov @tlrmchlsmth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest potential gotcha here is that num_tokens_after_padding will be divided in half for the DBO case. I suspect you will want to pad up the inputs to the "full" padded amount. The UbatchWrapper will take care of slicing them down to the ubatch sizes.
This is definitely annoying but once #25768 merges we won't have this difference.
|
Marking this PR as draft (so we dont land it by mistake) as it deadlocks during DBO sanity checks (benchmarking) |
Thanks @SageMoore . I noticed the update to |
Resolved IRL |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
5fe0c52 to
f01a7e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable, @varun-sundar-rabindranath. Thanks for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, enabling CI. It would be nice to have a unit test to catalog this behavior as we refactor the model runner
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Fixes #26137
PR #24845 removed the padding. This PR re-introduce the padding.
Test Plan
server:
ALL2ALL Backend Naive:
DBO:
DBO + small cudagraph size:
client
Test Result
ALL2ALL Backend Naive,DBOandDBO + small cudagraph sizedoesn't deadlock.