Skip to content

Conversation

@wuhang2014
Copy link
Contributor

@wuhang2014 wuhang2014 commented Jul 28, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.

Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

  1. Start an vllm instance using with --distributed-executor-backend mp.
  2. /v1/completions functions normally, and /health return code is 200.
  3. Kill CoreEngine process by kill -9 {engine_core_pid}.
  4. api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv# 

(Optional) Documentation Update

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important bug fixes for graceful handling of unexpected process exits in a multiprocessing environment.

  1. A monitoring thread is added to MPClient to watch over EngineCore processes. If an engine core process dies, the client is shut down, preventing hangs and ensuring the system state is consistent.
  2. A "death pipe" mechanism is implemented between EngineCore and its worker processes. This allows workers to detect if their parent EngineCore process has died, so they can terminate themselves and release resources, preventing orphan processes.

The implementation for both fixes appears robust and well-designed. I've identified one high-severity issue related to the resilience of the new engine core monitor thread, which could fail silently. Otherwise, the changes are clean and address the described issues effectively.

Comment on lines +526 to +541
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The monitor_engine_cores function lacks exception handling around the multiprocessing.connection.wait(sentinels) call. If this call is interrupted (e.g., by a signal) or raises an OSError, the monitor thread will terminate silently. This would leave the engine core processes unmonitored, defeating the purpose of this bugfix.

To make the monitoring more robust, this call should be wrapped in a try...except block. In case of an exception, it would be safest to log the error and proceed with shutting down the client.

        def monitor_engine_cores():
            sentinels = [proc.sentinel for proc in engine_processes]
            try:
                died = multiprocessing.connection.wait(sentinels)
            except BaseException as e:
                _self = self_ref()
                if not _self or _self.resources.engine_dead:
                    return
                logger.error("Error in engine core monitor: %s. Shutting down.", e)
                _self.resources.engine_dead = True
                _self.shutdown()
                return

            _self = self_ref()
            if not _self or _self.resources.engine_dead:
                return
            _self.resources.engine_dead = True
            proc_name = next(proc.name for proc in engine_processes
                             if proc.sentinel == died[0])
            logger.error(
                "Engine core proc %s died unexpectedly, "
                "shutting down client.", proc_name)
            _self.shutdown()
            # Note: For MPClient, we don't have a failure callback mechanism
            # like MultiprocExecutor, but we set engine_dead flag which will
            # cause subsequent operations to raise EngineDeadError

@wuhang2014 wuhang2014 marked this pull request as draft July 28, 2025 08:32
Signed-off-by: wuhang <wuhang6@huawei.com>
@wuhang2014 wuhang2014 marked this pull request as ready for review July 28, 2025 08:53
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@wuhang2014
Copy link
Contributor Author

@vllm-bot vllm-bot merged commit bccc43c into vllm-project:main Jul 28, 2025
15 checks passed
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry I merged this by mistake, feel free to revert if it causes any problems

@wuhang2014
Copy link
Contributor Author

wuhang2014 commented Jul 28, 2025

Oops sorry I merged this by mistake, feel free to revert if it causes any problems

Please @ me if any bugs are found related to this PR.

@wuhang2014 wuhang2014 deleted the engine_monitor branch July 29, 2025 02:28
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
…lm-project#21728)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…lm-project#21728)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…lm-project#21728)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…lm-project#21728)

Signed-off-by: wuhang <wuhang6@huawei.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
@cadedaniel
Copy link
Collaborator

Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit.

@ren2504413601
Copy link

ren2504413601 commented Sep 30, 2025

In Test Plan,there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.

Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

  1. Start an vllm instance using with --distributed-executor-backend mp.
  2. /v1/completions functions normally, and /health return code is 200.
  3. Kill CoreEngine process by kill -9 {engine_core_pid}.
  4. api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv# 

(Optional) Documentation Update

In Test Plan,there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}. @wuhang2014

@wuhang2014
Copy link
Contributor Author

wuhang2014 commented Oct 16, 2025

In Test Plan,there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Bug Description

In V1 architecture, EngineCore is a standalone process when specify --distributed-executor-backend mp. When EngineCore process is killed by kill -9 {engine_core_pid} or some other causes that cannot trigger a rpc ENGINE_CORE_DEAD message from EngineCore process to CoreClient process, the /health API still returns 200 and new requests to api server hang.

Solution

Like mechanism in EngineCore precess to monitor worker processes, a monitor of engine core processes thread is created to monitor all engine core processes that core client creates. When one engine is exited unexpectedly, monitor can find this change and finalize resources.
Another bug is also fixed: when core engine is dead unexpectedly, worker process will become an orphan. So a pipe between worker process and engine core process is created, and worker process will use the pipe to monitor if parant process is dead, when parent dead, worker process will kill itself to release hardware resources.

Test Plan

  1. Start an vllm instance using with --distributed-executor-backend mp.
  2. /v1/completions functions normally, and /health return code is 200.
  3. Kill CoreEngine process by kill -9 {engine_core_pid}.
  4. api server process and all its child processes are exited

Test Result

(venv) (base) root@ubuntu:/home/wuhang/venv# CUDA_VISIBLE_DEVICES=4 vllm serve /home/models/Qwen3-0.6B --trust-remote-code  --gpu-memory-utilization 0.85 --max-model-len 8192 --max-num-seq 30 --host 0.0.0.0 --port 6000 --distributed-executor-backend mp --enforce-eager
INFO 07-28 23:02:00 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:03 [api_server.py:1755] vLLM API server version 0.10.0rc2.dev60+g58b89da7e.d20250725
INFO 07-28 23:02:03 [cli_args.py:261] non-default args: {'model_tag': '/home/models/Qwen3-0.6B', 'host': '0.0.0.0', 'port': 6000, 'model': '/home/models/Qwen3-0.6B', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'distributed_executor_backend': 'mp', 'gpu_memory_utilization': 0.85, 'max_num_seqs': 30}
INFO 07-28 23:02:09 [config.py:1604] Using max model len 8192
INFO 07-28 23:02:10 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-28 23:02:15 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 23:02:17 [core.py:572] Waiting for init message from front-end.
INFO 07-28 23:02:17 [core.py:71] Initializing a V1 LLM engine (v0.10.0rc2.dev60+g58b89da7e.d20250725) with config: model='/home/models/Qwen3-0.6B', speculative_config=None, tokenizer='/home/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-28 23:02:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-28 23:02:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 16777216, 10, 'psm_3ff98987'), local_subscribe_addr='ipc:///tmp/5ea46bd6-0544-4c3a-b6d1-2ac7cd9261ed', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-28 23:02:21 [__init__.py:235] Automatically detected platform cuda.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_322127d8'), local_subscribe_addr='ipc:///tmp/1a429a0e-9cb9-4596-bfcc-b392b94c9295', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1843] Starting to load model /home/models/Qwen3-0.6B...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [gpu_model_runner.py:1875] Loading model from scratch...
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.76it/s]
(VllmWorker rank=0 pid=2120280) 
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:25 [default_loader.py:262] Loading weights took 0.38 seconds
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:26 [gpu_model_runner.py:1892] Model loading took 1.1201 GiB and 0.494173 seconds
(VllmWorker rank=0 pid=2120280) /home/wuhang/venv/.venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=2120280) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(VllmWorker rank=0 pid=2120280)   warnings.warn(
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:27 [gpu_worker.py:255] Available KV cache memory: 18.83 GiB
INFO 07-28 23:02:27 [kv_cache_utils.py:833] GPU KV cache size: 176,240 tokens
INFO 07-28 23:02:27 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 21.51x
INFO 07-28 23:02:27 [core.py:193] init engine (profile, create kv cache, warmup model) took 1.54 seconds
INFO 07-28 23:02:28 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 11015
WARNING 07-28 23:02:28 [config.py:1528] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 07-28 23:02:28 [serving_responses.py:89] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_chat.py:122] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 07-28 23:02:28 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:6000
INFO 07-28 23:02:28 [launcher.py:29] Available routes are:
INFO 07-28 23:02:28 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 07-28 23:02:28 [launcher.py:37] Route: /health, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /load, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /ping, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /version, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /pooling, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /classify, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /invocations, Methods: POST
INFO 07-28 23:02:28 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [2119430]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
(VllmWorker rank=0 pid=2120280) INFO 07-28 23:02:46 [multiproc_executor.py:510] Parent process exited, terminating worker
ERROR 07-28 23:02:49 [core_client.py:526] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
ERROR 07-28 23:02:49 [async_llm.py:416] AsyncLLM output_handler failed.
ERROR 07-28 23:02:49 [async_llm.py:416] Traceback (most recent call last):
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/async_llm.py", line 375, in output_handler
ERROR 07-28 23:02:49 [async_llm.py:416]     outputs = await engine_core.get_output_async()
ERROR 07-28 23:02:49 [async_llm.py:416]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-28 23:02:49 [async_llm.py:416]   File "/home/wuhang/venv/vllm/vllm/v1/engine/core_client.py", line 793, in get_output_async
ERROR 07-28 23:02:49 [async_llm.py:416]     raise self._format_exception(outputs) from None
ERROR 07-28 23:02:49 [async_llm.py:416] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2119430]
(venv) (base) root@ubuntu:/home/wuhang/venv# 

(Optional) Documentation Update

In Test Plan,there exists an error kill comand "Kill CoreEngine process by kill -9 {engine_core_pid}." , Becaues "kill -9 {engine_core_pid}" cannot be caught by try...except. It shoud be kill {engine_core_pid}. @wuhang2014

I think without this PR, core engine is acting properly when killed with other signals that can be caught.

@wuhang2014
Copy link
Contributor Author

Thanks for the PR. Could we add a test for this PR? Otherwise, this behavior cannot be relied upon by downstream users as it can break in any commit.

I think I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants