-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[BugFix][ModelRunner] properly handle unused buffer #25380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yan Ma <yan.ma@intel.com>
|
@youkaichao @sighingnow can you help take a look at this? following gemini comments will be enough? or any better fix? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an out-of-memory error on XPU platforms by releasing unused GPU buffers when speculative decoding is not active for a batch. While this frees memory, it introduces a critical bug: if a non-speculative batch is followed by a speculative one, the application will crash when trying to access the deallocated buffers. My review includes a comment explaining the issue and providing a solution to prevent this crash by re-initializing the buffers when needed.
| self.num_draft_tokens.gpu = None | ||
| self.num_accepted_tokens.gpu = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting .gpu to None here can lead to a crash in subsequent steps if speculative decoding is enabled but a batch without speculative tokens is followed by one with them. When use_spec_decode becomes True again, calls to copy_to_gpu() will fail with an AttributeError because they expect .gpu to be a tensor, not None.
To fix this, you should ensure the GPU tensors are re-initialized if they are None before they are used. This should be done in the else block where use_spec_decode is true, before copy_to_gpu() is called.
For example, in the else block around line 1104:
if self.num_draft_tokens.gpu is None:
self.num_draft_tokens.gpu = self.num_draft_tokens.cpu.to(self.device)
self.num_draft_tokens.copy_to_gpu()And similarly for num_accepted_tokens around line 1125:
if self.num_accepted_tokens.gpu is None:
self.num_accepted_tokens.gpu = self.num_accepted_tokens.cpu.to(self.device)
self.num_accepted_tokens.copy_to_gpu()Without this change, this PR introduces a latent bug that can cause crashes in mixed speculative/non-speculative workloads.
|
Can you provide an example that was breaking which is now fixed? As well as a test case to prevent further regression |
@robertgshaw2-redhat , Yan mentioned in the description that this is for Qwen_next running on XPU. Since these two bufferes have not been used for non_spec_decode, so it does make sense to set it as null, right? @yma11 , may you help provide a solid example. |
… nixl (#289) upstream PR introduced the crash: vllm-project/vllm#25380 Root cause: After this PR, nixl < 0.5.1 will not work because of lacking num_threads argument support. However, pip install nixl >= 0.5.1 will also not work because wheel provided UCX is compiled with hard-dependency on libcuda.so Solution is to do nixl installation by build from source. Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
… nixl (vllm-project#289) upstream PR introduced the crash: vllm-project/vllm#25380 Root cause: After this PR, nixl < 0.5.1 will not work because of lacking num_threads argument support. However, pip install nixl >= 0.5.1 will also not work because wheel provided UCX is compiled with hard-dependency on libcuda.so Solution is to do nixl installation by build from source. Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Iryna Boiko <iboiko@habana.ai>
|
Thanks @yma11. I think in non spec-decode case we can avoid creating these Also your branch is quite out of date .. it looks like |
| logits_indices = query_start_loc[1:] - 1 | ||
| num_draft_tokens = None | ||
| spec_decode_metadata = None | ||
| self.num_draft_tokens.gpu = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to change in init function for num_decode_draft_tokens and num_accepted_tokens
self.num_decode_draft_tokens = self._make_buffer(
self.max_num_reqs, dtype=torch.int32
) if self.speculative_config is not None else None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that's what I was suggesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. I need confirm this using my workload. Will update soon.
Purpose
#24526 introduced 2 more CpuGpuBuffer
self.num_draft_tokensandself.num_accepted_tokensfor Qwen3-next support. But in un-spec-decode case, they are actually not used. We found it trigger memory error(OOM) on XPU platform. This PR is trying to fix it and verified.Test Plan
Test Result