Skip to content

Commit 1116b82

Browse files
lhtinpdasigi
authored andcommitted
[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)
Signed-off-by: Lehua Ding <lehuading@tencent.com>
1 parent 226073e commit 1116b82

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

vllm/v1/worker/gpu_model_runner.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2478,7 +2478,7 @@ def propose_draft_token_ids(sampled_token_ids):
24782478
effective_drafter_max_model_len = (
24792479
self.speculative_config.draft_model_config.max_model_len)
24802480
input_fits_in_drafter = spec_decode_common_attn_metadata and (
2481-
spec_decode_common_attn_metadata.seq_lens.max() +
2481+
spec_decode_common_attn_metadata.max_seq_len +
24822482
self.speculative_config.num_speculative_tokens
24832483
<= effective_drafter_max_model_len)
24842484
if use_padded_batch_for_eagle and input_fits_in_drafter:

0 commit comments

Comments
 (0)