-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Bugfix]check health for engine core process exiting unexpectedly #21728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two important bug fixes for graceful handling of unexpected process exits in a multiprocessing environment.
- A monitoring thread is added to
MPClientto watch overEngineCoreprocesses. If an engine core process dies, the client is shut down, preventing hangs and ensuring the system state is consistent. - A "death pipe" mechanism is implemented between
EngineCoreand its worker processes. This allows workers to detect if their parentEngineCoreprocess has died, so they can terminate themselves and release resources, preventing orphan processes.
The implementation for both fixes appears robust and well-designed. I've identified one high-severity issue related to the resilience of the new engine core monitor thread, which could fail silently. Otherwise, the changes are clean and address the described issues effectively.
vllm/v1/engine/core_client.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The monitor_engine_cores function lacks exception handling around the multiprocessing.connection.wait(sentinels) call. If this call is interrupted (e.g., by a signal) or raises an OSError, the monitor thread will terminate silently. This would leave the engine core processes unmonitored, defeating the purpose of this bugfix.
To make the monitoring more robust, this call should be wrapped in a try...except block. In case of an exception, it would be safest to log the error and proceed with shutting down the client.
def monitor_engine_cores():
sentinels = [proc.sentinel for proc in engine_processes]
try:
died = multiprocessing.connection.wait(sentinels)
except BaseException as e:
_self = self_ref()
if not _self or _self.resources.engine_dead:
return
logger.error("Error in engine core monitor: %s. Shutting down.", e)
_self.resources.engine_dead = True
_self.shutdown()
return
_self = self_ref()
if not _self or _self.resources.engine_dead:
return
_self.resources.engine_dead = True
proc_name = next(proc.name for proc in engine_processes
if proc.sentinel == died[0])
logger.error(
"Engine core proc %s died unexpectedly, "
"shutting down client.", proc_name)
_self.shutdown()
# Note: For MPClient, we don't have a failure callback mechanism
# like MultiprocExecutor, but we set engine_dead flag which will
# cause subsequent operations to raise EngineDeadErrorSigned-off-by: wuhang <wuhang6@huawei.com>
e8917ac to
695a982
Compare
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry I merged this by mistake, feel free to revert if it causes any problems
Please @ me if any bugs are found related to this PR. |
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: x22x22 <wadeking@qq.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>
|
Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit. |
In Test Plan,there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}. @wuhang2014 |
I think without this PR, core engine is acting properly when killed with other signals that can be caught. |
I think I can. |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Bug Description
In V1 architecture, EngineCore is a standalone process when specify
--distributed-executor-backend mp. When EngineCore process is killed bykill -9 {engine_core_pid}or some other causes that cannot trigger a rpcENGINE_CORE_DEADmessage from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.Solution
Like mechanism in
EngineCoreprecess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.
Test Plan
--distributed-executor-backend mp./v1/completionsfunctions normally, and/healthreturn code is 200.kill -9 {engine_core_pid}.Test Result
(Optional) Documentation Update