-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler #13594
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! It seems that this PR adds the sampling logic in profile_run
we have on V0 into V1, but rejection sampler is not considered here (which is fine for this PR imo)
Which script did you run to get the # blocks difference in the PR description?
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
fa6cbff
to
185d41f
Compare
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
cc @njhill In case you want to review this PR too |
vllm/v1/worker/gpu_model_runner.py
Outdated
frequency_penalties=penalties, | ||
presence_penalties=penalties, | ||
repetition_penalties=penalties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update these since we shouldn't be using the same tensor object for all three of them.
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Thanks @JenZhao this looks good. My only hesitation is that the penalties sampler implementation still needs to be optimized further and including them here may mean more memory ends up being reserved than necessary for a typical case, harming throughput. For a max batch size of 1k, this could be an overhead of more than 2GB just for penalties sampling. I'm not sure there's a good answer for this though since in theory you could hit OOM in some cases otherwise. I guess it may be better to merge this and then try to address the inefficiency afterwards asap. Also cc @WoosukKwon for thoughts. |
oh no I did incorrect rebase. fixing now |
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
88173f5
to
6dff92e
Compare
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JenZhao Thanks again for your contribution. I'm giving this PR a greenlight but please take a look at my comment!
vllm/v1/worker/gpu_model_runner.py
Outdated
min_p=None, | ||
generators={}, | ||
max_num_logprobs=None, | ||
no_penalties=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some peak memory analysis with / without this PR when running VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json
on 1xH100.
Main(no sampler): 75998MiB / 81559MiB
This PR (no penalties, prompt_token_ids = None
): 73536MiB / 81559MiB
This PR (no penalties, prompt_token_ids = torch.ones_like(logits, dtype=torch.long
): 72560MiB / 81559MiB
This PR (with penalties): 70046MiB / 81559MiB
Given that we're setting the default GMU to 0.9 which already gives some room for inference with penalties, I think it's ok if we go with the third option (profiling with no_penalties=True
).
It is indeed weird though why we're seeing this memory bump in the actual inference workload when there were no penalties applied (scenario 2 vs scenario 3), so perhaps there's somewhere where the prompt_token_ids (or a tensor of same size) are used but not captured during profile_run
.
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
@JenZhao Congrats for your first PR! @JenZhao @ywang96 I've observed that the peak memory goes much beyond 90%, even after this PR:
I encountered this error while running |
BTW, weirdly, the memory taken by the vLLM instance got reduced when the server finished handling all the requests and became idle.
As we do not call |
This PR integrates the sampling logic from the V0 version of profile_run into the V1 memory profiling tool. Note that the rejection sampler is not included.
Testing on H100 GPU with CUDA 12.4 with
VLLM_USE_V1=1 python3 examples/offline_inference/basic/basic.py
before the change
after the change, no_penalties=True, all other parameters set
after the change, no_penalties=False, all other parameters set
Testing on H100 GPU with CUDA 12.4 with
VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model NousResearch/Hermes-3-Llama-3.1-8B --dataset /home/jovyan/vllm/ShareGPT_V3_unfiltered_cleaned_split.json
No sampler available, original main code
Sampler available, no penalties, prompt token IDs are None
Sampler available, no penalties, prompt token IDs provided.
Sampler available, penalties applied, prompt token IDs provided.
FIX #13507