Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1084,6 +1084,8 @@ def _prepare_inputs(
logits_indices = query_start_loc[1:] - 1
num_draft_tokens = None
spec_decode_metadata = None
self.num_draft_tokens.gpu = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to change in init function for num_decode_draft_tokens and num_accepted_tokens

        self.num_decode_draft_tokens = self._make_buffer(
            self.max_num_reqs, dtype=torch.int32
        ) if self.speculative_config is not None else None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's what I was suggesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. I need confirm this using my workload. Will update soon.

self.num_accepted_tokens.gpu = None
Comment on lines +1087 to +1088
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Setting .gpu to None here can lead to a crash in subsequent steps if speculative decoding is enabled but a batch without speculative tokens is followed by one with them. When use_spec_decode becomes True again, calls to copy_to_gpu() will fail with an AttributeError because they expect .gpu to be a tensor, not None.

To fix this, you should ensure the GPU tensors are re-initialized if they are None before they are used. This should be done in the else block where use_spec_decode is true, before copy_to_gpu() is called.

For example, in the else block around line 1104:

if self.num_draft_tokens.gpu is None:
    self.num_draft_tokens.gpu = self.num_draft_tokens.cpu.to(self.device)
self.num_draft_tokens.copy_to_gpu()

And similarly for num_accepted_tokens around line 1125:

if self.num_accepted_tokens.gpu is None:
    self.num_accepted_tokens.gpu = self.num_accepted_tokens.cpu.to(self.device)
self.num_accepted_tokens.copy_to_gpu()

Without this change, this PR introduces a latent bug that can cause crashes in mixed speculative/non-speculative workloads.

else:
# Get the number of draft tokens for each request.
# Iterate over the dictionary rather than all requests since not all
Expand Down