Skip to content

Conversation

@yma11
Copy link
Contributor

@yma11 yma11 commented Sep 22, 2025

Purpose

#24526 introduced 2 more CpuGpuBuffer self.num_draft_tokens and self.num_accepted_tokens for Qwen3-next support. But in un-spec-decode case, they are actually not used. We found it trigger memory error(OOM) on XPU platform. This PR is trying to fix it and verified.

Test Plan

Test Result

Signed-off-by: Yan Ma <yan.ma@intel.com>
@yma11
Copy link
Contributor Author

yma11 commented Sep 22, 2025

@youkaichao @sighingnow can you help take a look at this? following gemini comments will be enough? or any better fix? Thanks.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an out-of-memory error on XPU platforms by releasing unused GPU buffers when speculative decoding is not active for a batch. While this frees memory, it introduces a critical bug: if a non-speculative batch is followed by a speculative one, the application will crash when trying to access the deallocated buffers. My review includes a comment explaining the issue and providing a solution to prevent this crash by re-initializing the buffers when needed.

Comment on lines +1087 to +1088
self.num_draft_tokens.gpu = None
self.num_accepted_tokens.gpu = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting .gpu to None here can lead to a crash in subsequent steps if speculative decoding is enabled but a batch without speculative tokens is followed by one with them. When use_spec_decode becomes True again, calls to copy_to_gpu() will fail with an AttributeError because they expect .gpu to be a tensor, not None.

To fix this, you should ensure the GPU tensors are re-initialized if they are None before they are used. This should be done in the else block where use_spec_decode is true, before copy_to_gpu() is called.

For example, in the else block around line 1104:

if self.num_draft_tokens.gpu is None:
    self.num_draft_tokens.gpu = self.num_draft_tokens.cpu.to(self.device)
self.num_draft_tokens.copy_to_gpu()

And similarly for num_accepted_tokens around line 1125:

if self.num_accepted_tokens.gpu is None:
    self.num_accepted_tokens.gpu = self.num_accepted_tokens.cpu.to(self.device)
self.num_accepted_tokens.copy_to_gpu()

Without this change, this PR introduces a latent bug that can cause crashes in mixed speculative/non-speculative workloads.

@robertgshaw2-redhat robertgshaw2-redhat changed the title properly handle unused buffer [BugFix][ModelRunner] properly handle unused buffer Sep 22, 2025
@robertgshaw2-redhat
Copy link
Collaborator

Can you provide an example that was breaking which is now fixed? As well as a test case to prevent further regression

@xuechendi
Copy link
Contributor

xuechendi commented Sep 29, 2025

Can you provide an example that was breaking which is now fixed? As well as a test case to prevent further regression

@robertgshaw2-redhat , Yan mentioned in the description that this is for Qwen_next running on XPU. Since these two bufferes have not been used for non_spec_decode, so it does make sense to set it as null, right?

@yma11 , may you help provide a solid example.

xuechendi added a commit to vllm-project/vllm-gaudi that referenced this pull request Sep 29, 2025
… nixl (#289)

upstream PR introduced the crash: 
vllm-project/vllm#25380

 Root cause:

After this PR, nixl < 0.5.1 will not work because of lacking num_threads
argument support.
However, pip install nixl >= 0.5.1 will also not work because wheel
provided UCX is compiled with hard-dependency on libcuda.so

 Solution is to do nixl installation by build from source.

Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
iboiko-habana pushed a commit to iboiko-habana/vllm-gaudi that referenced this pull request Oct 2, 2025
… nixl (vllm-project#289)

upstream PR introduced the crash:
vllm-project/vllm#25380

 Root cause:

After this PR, nixl < 0.5.1 will not work because of lacking num_threads
argument support.
However, pip install nixl >= 0.5.1 will also not work because wheel
provided UCX is compiled with hard-dependency on libcuda.so

 Solution is to do nixl installation by build from source.

Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
@njhill
Copy link
Member

njhill commented Oct 27, 2025

Thanks @yma11. I think in non spec-decode case we can avoid creating these CpuGpuBuffers altogether?

Also your branch is quite out of date .. it looks like self.num_draft_tokens has now been renamed to self.num_decode_draft_tokens...

logits_indices = query_start_loc[1:] - 1
num_draft_tokens = None
spec_decode_metadata = None
self.num_draft_tokens.gpu = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to change in init function for num_decode_draft_tokens and num_accepted_tokens

        self.num_decode_draft_tokens = self._make_buffer(
            self.max_num_reqs, dtype=torch.int32
        ) if self.speculative_config is not None else None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's what I was suggesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. I need confirm this using my workload. Will update soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants