-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Correctly kill vLLM processes after finishing serving benchmarks #21641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly kill vLLM processes after finishing serving benchmarks #21641
Conversation
Signed-off-by: Huy Do <huydhn@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request correctly identifies the need to terminate vLLM processes by their new name to fix the CI pipeline. However, the implementation can be made more robust to avoid a potential race condition where the pkill command might terminate itself. I've suggested a small change to improve the reliability of the script.
| pkill -f text-generation | ||
| pkill -f lmdeploy | ||
| # vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445 | ||
| pkill -f VLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pkill -f VLLM can be unreliable as it might match and kill its own process before it has a chance to terminate the target vLLM processes. This is because the command pkill -f VLLM itself contains the string "VLLM" and will be found by pkill -f.
A more robust and common practice is to use a pattern like [V]LLM. This prevents pkill from matching its own command in the process list, ensuring it only targets the intended vLLM processes and avoiding a potential race condition.
| pkill -f VLLM | |
| pkill -f '[V]LLM' |
Signed-off-by: Huy Do <huydhn@gmail.com>
| pkill -f '[t]ext-generation' | ||
| pkill -f '[l]mdeploy' | ||
| # vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445 | ||
| pkill -f '[V]LLM' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The square bracket is a good sugesstion from Gemini
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
|
Thanks~ |
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…m-project#21641) Signed-off-by: Huy Do <huydhn@gmail.com>
Essential Elements of an Effective PR Description Checklist
Purpose
After #21445, all vLLM processes are now named with VLLM prefix. This breaks the benchmark script where it fails to kill these processes to stop the benchmark. So, CI benchmark jobs wait forever till they are timing out, i.e. https://github.com/pytorch/pytorch-integration-testing/actions/runs/16530638203/job/46754923133
Test Plan
Manually run a benchmark with the code from this PR https://github.com/pytorch/pytorch-integration-testing/actions/runs/16534596580