[P/D] Add a shutdown method to the Connector API #22699

chaunceyjiang · 2025-08-12T03:50:27Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

For some PD Connectors, it is necessary to perform certain cleanup actions, such as closing connections, when shutting down D.

Test Plan

Test Result

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request introduces a shutdown method to the KVConnector API, aimed at ensuring proper resource cleanup during shutdown. The implementation registers this new method with Python's atexit module to be called on process exit. While this addresses the need for cleanup, relying on atexit can be unreliable in some scenarios and may cause issues in complex applications. My review highlights this potential issue and suggests integrating the shutdown logic into vLLM's existing explicit shutdown sequence for better robustness.

vllm/distributed/kv_transfer/kv_transfer_state.py

github-actions · 2025-08-12T04:04:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/distributed/kv_transfer/kv_transfer_state.py

chaunceyjiang · 2025-08-13T12:51:17Z

/cc @njhill PTAL

KuntaiDu

LGTM, but also we need to let @njhill take a look.

vllm/worker/worker_base.py

chaunceyjiang · 2025-09-01T08:01:41Z

@NickLucche PTAL.

NickLucche

looking cleaner thanks!

panpan0000 · 2025-09-02T13:42:30Z

@KuntaiDu will diff connector like LMCache need to implement their own shutdown()?

njhill

Thanks @chaunceyjiang, please see inline comments.

In addition I think we also need to call the executor shutdown method from EngineCore.shutdown() in core.py.

vllm/distributed/kv_transfer/kv_connector/v1/base.py

vllm/worker/worker_base.py

vllm/executor/uniproc_executor.py

njhill · 2025-09-02T21:32:34Z

@KuntaiDu will diff connector like LMCache need to implement their own shutdown()?

@panpan0000 they don't need to but can do if it makes sense.

njhill · 2025-09-02T22:15:58Z

@chaunceyjiang another missing link is that the multiproc worker proc should call shutdown() in its handling of death_pipe.

What might be better there instead of calling os.kill is to put a sentinel message on the input queue and use a shutdown event in the dequeue() method of the worker busy loop. The death_pipe logic can call this instead, and then that loop can call self.shutdown() before it exits.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

vllm/v1/executor/multiproc_executor.py

njhill

Thanks @chaunceyjiang

chaunceyjiang · 2025-09-08T02:27:11Z


packages/vllm/distributed/device_communicators/cuda_communicator.py", line 52, in __init__
--
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]     self.pynccl_comm = PyNcclCommunicator(
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]                        ^^^^^^^^^^^^^^^^^^^
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 106, in __init__
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]     self.all_reduce(data)
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 127, in all_reduce
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]     self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 314, in ncclAllReduce
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588]     raise RuntimeError(f"NCCL error: {error_str}")
  | [2025-09-07T21:33:01Z] (EngineCore_0 pid=278) (VllmWorker pid=299) ERROR 09-07 14:33:01 [multiproc_executor.py:588] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Hi @DarkLight1337, Could you please help re-run the e2e test?

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_transfer_state.py Outdated Show resolved Hide resolved

chaunceyjiang requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 12, 2025 08:34

mergify bot added the v1 label Aug 12, 2025

chaunceyjiang force-pushed the shutdown branch from ed8426a to 5790ef1 Compare August 12, 2025 08:35

chaunceyjiang commented Aug 12, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_transfer_state.py Outdated Show resolved Hide resolved

chaunceyjiang force-pushed the shutdown branch from ddb1152 to 920ba0f Compare August 13, 2025 08:07

chaunceyjiang requested review from youkaichao and zhuohan123 as code owners August 13, 2025 08:54

chaunceyjiang force-pushed the shutdown branch from fba851e to 25974c6 Compare August 19, 2025 06:41

KuntaiDu approved these changes Aug 20, 2025

View reviewed changes

NickLucche requested changes Aug 29, 2025

View reviewed changes

vllm/worker/worker_base.py Outdated Show resolved Hide resolved

chaunceyjiang force-pushed the shutdown branch from 25974c6 to 344e254 Compare September 1, 2025 07:21

mergify bot added the tpu Related to Google TPUs label Sep 1, 2025

chaunceyjiang force-pushed the shutdown branch from 344e254 to 4624fd0 Compare September 1, 2025 07:24

chaunceyjiang requested a review from NickLucche September 1, 2025 07:24

chaunceyjiang force-pushed the shutdown branch from 2712f6f to 333c2e2 Compare September 1, 2025 08:05

NickLucche approved these changes Sep 1, 2025

View reviewed changes

njhill reviewed Sep 2, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/base.py Outdated Show resolved Hide resolved

vllm/worker/worker_base.py Outdated Show resolved Hide resolved

vllm/worker/worker_base.py Outdated Show resolved Hide resolved

vllm/executor/uniproc_executor.py Outdated Show resolved Hide resolved

chaunceyjiang force-pushed the shutdown branch from 505db0f to ce02dcc Compare September 3, 2025 14:41

chaunceyjiang added 8 commits September 7, 2025 02:44

[P/D] Add a shutdown method to the Connector API

9609d8f

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

386d241

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

8db3e70

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

cb060b7

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

84c0144

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

7d139b0

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

cfbba8e

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[P/D] Add a shutdown method to the Connector API

44b8b43

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang force-pushed the shutdown branch from 49c605e to 44b8b43 Compare September 7, 2025 02:47

chaunceyjiang requested a review from 22quinn as a code owner September 7, 2025 02:47

mergify bot removed the needs-rebase label Sep 7, 2025

[P/D] Add a shutdown method to the Connector API

cb60f67

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang commented Sep 7, 2025

View reviewed changes

vllm/v1/executor/multiproc_executor.py Show resolved Hide resolved

njhill approved these changes Sep 7, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 7, 2025

vllm-bot merged commit 61aa4b2 into vllm-project:main Sep 8, 2025
48 of 51 checks passed

chaunceyjiang mentioned this pull request Sep 8, 2025

[P/D] MultiConnector supports shutdown #24425

Merged

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[P/D] Add a shutdown method to the Connector API (vllm-project#22699)

02d1702

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

njhill mentioned this pull request Sep 11, 2025

[BugFix] Fix pipeline parallel #24621

Merged

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[P/D] Add a shutdown method to the Connector API (vllm-project#22699)

7fd342d

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[P/D] Add a shutdown method to the Connector API (vllm-project#22699)

31027ab

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[P/D] Add a shutdown method to the Connector API (vllm-project#22699)

4619f7d

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[P/D] Add a shutdown method to the Connector API (vllm-project#22699)

b503e64

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Uh oh!

[P/D] Add a shutdown method to the Connector API #22699

[P/D] Add a shutdown method to the Connector API #22699

Uh oh!

Conversation

chaunceyjiang commented Aug 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

Uh oh!

chaunceyjiang commented Aug 13, 2025

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chaunceyjiang commented Sep 1, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

panpan0000 commented Sep 2, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill commented Sep 2, 2025

Uh oh!

njhill commented Sep 2, 2025

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chaunceyjiang commented Aug 12, 2025 •

edited by github-actions bot

Loading