[BugFix][ModelRunner] properly handle unused buffer #25380

yma11 · 2025-09-22T11:11:41Z

Purpose

#24526 introduced 2 more CpuGpuBuffer self.num_draft_tokens and self.num_accepted_tokens for Qwen3-next support. But in un-spec-decode case, they are actually not used. We found it trigger memory error(OOM) on XPU platform. This PR is trying to fix it and verified.

Test Plan

Test Result

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 · 2025-09-22T11:12:41Z

@youkaichao @sighingnow can you help take a look at this? following gemini comments will be enough? or any better fix? Thanks.

gemini-code-assist

Code Review

This pull request aims to fix an out-of-memory error on XPU platforms by releasing unused GPU buffers when speculative decoding is not active for a batch. While this frees memory, it introduces a critical bug: if a non-speculative batch is followed by a speculative one, the application will crash when trying to access the deallocated buffers. My review includes a comment explaining the issue and providing a solution to prevent this crash by re-initializing the buffers when needed.

gemini-code-assist · 2025-09-22T11:18:50Z

vllm/v1/worker/gpu_model_runner.py

+            self.num_draft_tokens.gpu = None
+            self.num_accepted_tokens.gpu = None


Setting .gpu to None here can lead to a crash in subsequent steps if speculative decoding is enabled but a batch without speculative tokens is followed by one with them. When use_spec_decode becomes True again, calls to copy_to_gpu() will fail with an AttributeError because they expect .gpu to be a tensor, not None.

To fix this, you should ensure the GPU tensors are re-initialized if they are None before they are used. This should be done in the else block where use_spec_decode is true, before copy_to_gpu() is called.

For example, in the else block around line 1104:

if self.num_draft_tokens.gpu is None: self.num_draft_tokens.gpu = self.num_draft_tokens.cpu.to(self.device) self.num_draft_tokens.copy_to_gpu()

And similarly for num_accepted_tokens around line 1125:

if self.num_accepted_tokens.gpu is None: self.num_accepted_tokens.gpu = self.num_accepted_tokens.cpu.to(self.device) self.num_accepted_tokens.copy_to_gpu()

Without this change, this PR introduces a latent bug that can cause crashes in mixed speculative/non-speculative workloads.

robertgshaw2-redhat · 2025-09-22T12:51:48Z

Can you provide an example that was breaking which is now fixed? As well as a test case to prevent further regression

xuechendi · 2025-09-29T16:29:46Z

Can you provide an example that was breaking which is now fixed? As well as a test case to prevent further regression

@robertgshaw2-redhat , Yan mentioned in the description that this is for Qwen_next running on XPU. Since these two bufferes have not been used for non_spec_decode, so it does make sense to set it as null, right?

@yma11 , may you help provide a solid example.

… nixl (#289) upstream PR introduced the crash: vllm-project/vllm#25380 Root cause: After this PR, nixl < 0.5.1 will not work because of lacking num_threads argument support. However, pip install nixl >= 0.5.1 will also not work because wheel provided UCX is compiled with hard-dependency on libcuda.so Solution is to do nixl installation by build from source. Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>

… nixl (vllm-project#289) upstream PR introduced the crash: vllm-project/vllm#25380 Root cause: After this PR, nixl < 0.5.1 will not work because of lacking num_threads argument support. However, pip install nixl >= 0.5.1 will also not work because wheel provided UCX is compiled with hard-dependency on libcuda.so Solution is to do nixl installation by build from source. Signed-off-by: Chendi Xue <Chendi.Xue@intel.com> Signed-off-by: Iryna Boiko <iboiko@habana.ai>

njhill · 2025-10-27T19:18:07Z

Thanks @yma11. I think in non spec-decode case we can avoid creating these CpuGpuBuffers altogether?

Also your branch is quite out of date .. it looks like self.num_draft_tokens has now been renamed to self.num_decode_draft_tokens...

jikunshang · 2025-10-28T00:24:44Z

vllm/v1/worker/gpu_model_runner.py

            logits_indices = query_start_loc[1:] - 1
            num_draft_tokens = None
            spec_decode_metadata = None
+            self.num_draft_tokens.gpu = None


I prefer to change in init function for num_decode_draft_tokens and num_accepted_tokens

self.num_decode_draft_tokens = self._make_buffer( self.max_num_reqs, dtype=torch.int32 ) if self.speculative_config is not None else None

yep that's what I was suggesting

Thanks for suggestion. I need confirm this using my workload. Will update soon.

properly handle unused CpuGpuBuffer

dbac6b7

Signed-off-by: Yan Ma <yan.ma@intel.com>

yma11 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 22, 2025 11:11

mergify bot added the v1 label Sep 22, 2025

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

robertgshaw2-redhat changed the title ~~properly handle unused buffer~~ [BugFix][ModelRunner] properly handle unused buffer Sep 22, 2025

xuechendi mentioned this pull request Sep 29, 2025

[Fix Hourly] install UCX from source instead using builtin wheel from nixl vllm-project/vllm-gaudi#289

Merged

jikunshang reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix][ModelRunner] properly handle unused buffer #25380

[BugFix][ModelRunner] properly handle unused buffer #25380

yma11 commented Sep 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

yma11 commented Sep 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 22, 2025

Uh oh!

robertgshaw2-redhat commented Sep 22, 2025

Uh oh!

xuechendi commented Sep 29, 2025 •

edited

Loading

Uh oh!

njhill commented Oct 27, 2025

Uh oh!

jikunshang Oct 28, 2025

Uh oh!

njhill Oct 28, 2025

Uh oh!

yma11 Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		self.num_draft_tokens.gpu = None
		self.num_accepted_tokens.gpu = None

Uh oh!

[BugFix][ModelRunner] properly handle unused buffer #25380

Are you sure you want to change the base?

[BugFix][ModelRunner] properly handle unused buffer #25380

Conversation

yma11 commented Sep 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

yma11 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Sep 22, 2025

Uh oh!

xuechendi commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Oct 27, 2025

Uh oh!

jikunshang Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

njhill Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

yma11 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yma11 commented Sep 22, 2025 •

edited by github-actions bot

Loading

yma11 commented Sep 22, 2025 •

edited

Loading

xuechendi commented Sep 29, 2025 •

edited

Loading