[Bugfix]check health for engine core process exiting unexpectedly #21728

wuhang2014 · 2025-07-28T08:30:58Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.

Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

Start an vllm instance using with --distributed-executor-backend mp.
/v1/completions functions normally, and /health return code is 200.
Kill CoreEngine process by kill -9 {engine_core_pid}.
api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv#

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request introduces two important bug fixes for graceful handling of unexpected process exits in a multiprocessing environment.

A monitoring thread is added to MPClient to watch over EngineCore processes. If an engine core process dies, the client is shut down, preventing hangs and ensuring the system state is consistent.
A "death pipe" mechanism is implemented between EngineCore and its worker processes. This allows workers to detect if their parent EngineCore process has died, so they can terminate themselves and release resources, preventing orphan processes.

The implementation for both fixes appears robust and well-designed. I've identified one high-severity issue related to the resilience of the new engine core monitor thread, which could fail silently. Otherwise, the changes are clean and address the described issues effectively.

gemini-code-assist · 2025-07-28T08:32:50Z

vllm/v1/engine/core_client.py

The monitor_engine_cores function lacks exception handling around the multiprocessing.connection.wait(sentinels) call. If this call is interrupted (e.g., by a signal) or raises an OSError, the monitor thread will terminate silently. This would leave the engine core processes unmonitored, defeating the purpose of this bugfix.

To make the monitoring more robust, this call should be wrapped in a try...except block. In case of an exception, it would be safest to log the error and proceed with shutting down the client.

def monitor_engine_cores(): sentinels = [proc.sentinel for proc in engine_processes] try: died = multiprocessing.connection.wait(sentinels) except BaseException as e: _self = self_ref() if not _self or _self.resources.engine_dead: return logger.error("Error in engine core monitor: %s. Shutting down.", e) _self.resources.engine_dead = True _self.shutdown() return _self = self_ref() if not _self or _self.resources.engine_dead: return _self.resources.engine_dead = True proc_name = next(proc.name for proc in engine_processes if proc.sentinel == died[0]) logger.error( "Engine core proc %s died unexpectedly, " "shutting down client.", proc_name) _self.shutdown() # Note: For MPClient, we don't have a failure callback mechanism # like MultiprocExecutor, but we set engine_dead flag which will # cause subsequent operations to raise EngineDeadError

Signed-off-by: wuhang <wuhang6@huawei.com>

github-actions · 2025-07-28T09:12:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

wuhang2014 · 2025-07-28T12:11:00Z

cc @njhill @robertgshaw2-redhat #19881

DarkLight1337

Oops sorry I merged this by mistake, feel free to revert if it causes any problems

wuhang2014 · 2025-07-28T14:32:59Z

Oops sorry I merged this by mistake, feel free to revert if it causes any problems

Please @ me if any bugs are found related to this PR.

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: x22x22 <wadeking@qq.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

cadedaniel · 2025-09-22T23:37:40Z

Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit.

ren2504413601 · 2025-09-30T09:44:25Z

In Test Plan，there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.

Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

Start an vllm instance using with --distributed-executor-backend mp.
/v1/completions functions normally, and /health return code is 200.
Kill CoreEngine process by kill -9 {engine_core_pid}.
api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv#

(Optional) Documentation Update

In Test Plan，there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}. @wuhang2014

wuhang2014 · 2025-10-16T10:00:03Z

In Test Plan，there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.
Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

Start an vllm instance using with --distributed-executor-backend mp.
/v1/completions functions normally, and /health return code is 200.
Kill CoreEngine process by kill -9 {engine_core_pid}.
api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv#

(Optional) Documentation Update

In Test Plan，there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}. @wuhang2014

I think without this PR, core engine is acting properly when killed with other signals that can be caught.

wuhang2014 · 2025-10-16T10:11:04Z

Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit.

I think I can.

wuhang2014 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners July 28, 2025 08:30

mergify bot added the v1 label Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

wuhang2014 marked this pull request as draft July 28, 2025 08:32

check health for engine core process

695a982

Signed-off-by: wuhang <wuhang6@huawei.com>

wuhang2014 force-pushed the engine_monitor branch from e8917ac to 695a982 Compare July 28, 2025 08:39

wuhang2014 marked this pull request as ready for review July 28, 2025 08:53

vllm-bot merged commit bccc43c into vllm-project:main Jul 28, 2025
15 checks passed

DarkLight1337 reviewed Jul 28, 2025

View reviewed changes

wuhang2014 deleted the engine_monitor branch July 29, 2025 02:28

njhill mentioned this pull request Jul 30, 2025

[Bugfix] Enable killing of orphaned EngineCores #21915

Closed

4 tasks

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

c47fcc3

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

e03599c

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

55d2ae1

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

d9e1105

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

d5c2994

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

b825be7

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

67b9504

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

b3326dd

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Bugfix]check health for engine core process exiting unexpectedly (vl…

d9bbadf

…lm-project#21728) Signed-off-by: wuhang <wuhang6@huawei.com>

dongbo910220 mentioned this pull request Sep 15, 2025

feat(api): Return 503 on /health when engine is dead #24897

Merged

Uh oh!

[Bugfix]check health for engine core process exiting unexpectedly #21728

[Bugfix]check health for engine core process exiting unexpectedly #21728

Uh oh!

Conversation

wuhang2014 commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Bug Description

Solution

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

wuhang2014 commented Jul 28, 2025

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

wuhang2014 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cadedaniel commented Sep 22, 2025

Uh oh!

ren2504413601 commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Bug Description

Solution

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

wuhang2014 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Bug Description

Solution

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

wuhang2014 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wuhang2014 commented Jul 28, 2025 •

edited by github-actions bot

Loading

wuhang2014 commented Jul 28, 2025 •

edited

Loading

ren2504413601 commented Sep 30, 2025 •

edited

Loading

wuhang2014 commented Oct 16, 2025 •

edited

Loading