Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler #13594

Merged
merged 11 commits into from
Feb 22, 2025
32 changes: 30 additions & 2 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig,
KVCacheSpec)
from vllm.v1.outputs import LogprobsTensors, ModelRunnerOutput
from vllm.v1.sample.metadata import SamplingMetadata
from vllm.v1.sample.rejection_sampler import INVALID_TOKEN_ID
from vllm.v1.spec_decode.ngram_proposer import NgramProposer
from vllm.v1.utils import bind_kv_cache
Expand Down Expand Up @@ -1303,11 +1304,38 @@ def profile_run(self) -> None:
if get_pp_group().is_last_rank:
hidden_states = hidden_states[logit_indices]
logits = self.model.compute_logits(hidden_states, None)
# TODO(woosuk): Consider the memory usage of the sampler.
penalties = torch.full((num_reqs, ), 0.0, device=self.device)
dummy_metadata = SamplingMetadata(
temperature=torch.full((num_reqs, ),
0.5,
device=self.device),
all_greedy=False,
all_random=False,
spec_token_ids=None,
top_p=torch.full((num_reqs, ), 0.99, device=self.device),
top_k=torch.full((num_reqs, ),
logits.size(1) - 1,
device=self.device),
min_p=None,
generators={},
max_num_logprobs=None,
no_penalties=True,
prompt_token_ids=None,
frequency_penalties=penalties,
presence_penalties=penalties,
repetition_penalties=penalties,
Copy link
Member

@ywang96 ywang96 Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update these since we shouldn't be using the same tensor object for all three of them.

output_token_ids=[[] for _ in range(num_reqs)],
min_tokens={},
logit_bias=[None for _ in range(num_reqs)])
sampler_output = self.model.sample(
logits=logits, sampling_metadata=dummy_metadata)
else:
logits = None
sampler_output = None
penalties = None
dummy_metadata = None
torch.cuda.synchronize()
del hidden_states, logits
del hidden_states, logits, sampler_output, penalties, dummy_metadata
self.encoder_cache.clear()
gc.collect()

Expand Down