[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2174

underfituu · 2025-08-02T02:15:21Z

What this PR does / why we need it?

This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache.

Trigger Scenarios:

Node D fails mid-inference (e.g., network disconnection)
Node D rejects requests at a certain stage (e.g., via API server)
Load-test script termination causes Node P or D to abort queued requests

Root Cause Analysis:

Currently, Node D sends a "KV cache pull complete, release approved" message to Node P
This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message
Node P will never release KV cache without receiving this message

Solution:
Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing:

A timeout mechanism with comprehensive warnings
Updated README documentation
Reference: VLLM's optimization PR #20139

Does this PR introduce any user-facing change?

None

How was this patch tested?

None

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@9607d5e

github-actions · 2025-08-02T02:18:05Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

codecov · 2025-08-02T03:11:37Z

Codecov Report

❌ Patch coverage is 20.00000% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.96%. Comparing base (6e00aed) to head (32838d3).
⚠️ Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
..._ascend/distributed/llmdatadist_c_mgr_connector.py	20.00%	44 Missing ⚠️

❌ Your patch check has failed because the patch coverage (20.00%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2174      +/-   ##
==========================================
- Coverage   76.17%   75.96%   -0.22%     
==========================================
  Files         112      112              
  Lines       12459    12506      +47     
==========================================
+ Hits         9491     9500       +9     
- Misses       2968     3006      +38

Flag	Coverage Δ
unittests	`75.96% <20.00%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-08-04T09:29:33Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…tion (#2085) ### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. **Trigger Scenarios:** 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests **Root Cause Analysis:** 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message **Solution:** Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](vllm-project/vllm#20139) **Note:** The full disaster recovery solution is still in design. This PR will be merged into v091-dev branch simply but will evolve in main ([PR #2174](#2174)). ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: underfituu <hzhucong@163.com>

weijinqian0 · 2025-09-15T01:36:33Z

approved

wangxiyuan · 2025-09-18T06:23:39Z

please rebase and make CI happy

Signed-off-by: underfituu <hzhucong@163.com>

…lm-project#2174) ### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. **Trigger Scenarios:** 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests **Root Cause Analysis:** 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message **Solution:** Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](vllm-project/vllm#20139) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? None - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: underfituu <hzhucong@163.com>

underfituu changed the title ~~auto_clear_kv_cache~~ [Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085 Aug 2, 2025

github-actions bot added the module:core label Aug 2, 2025

underfituu mentioned this pull request Aug 2, 2025

[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085

Merged

underfituu changed the title ~~[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2085~~ [Bugfix][PD] Auto-clear producer KV cache if no pull notification Aug 2, 2025

github-actions bot added the merge-conflicts label Aug 4, 2025

rebase main

79c1835

Signed-off-by: underfituu <hzhucong@163.com>

liziyu179 force-pushed the auto_clear_kv_cache_main branch from 32838d3 to 79c1835 Compare September 22, 2025 05:21

github-actions bot removed the merge-conflicts label Sep 22, 2025

underfituu added 3 commits September 22, 2025 13:31

auto_clear_cache_for_prefill_node

60d9421

Signed-off-by: underfituu <hzhucong@163.com>

fix envs_ascend

824006a

Signed-off-by: underfituu <hzhucong@163.com>

delete ascend env

03cc38a

Signed-off-by: underfituu <hzhucong@163.com>

wangxiyuan approved these changes Sep 22, 2025

View reviewed changes

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 22, 2025

wangxiyuan merged commit 8dd53c8 into vllm-project:main Sep 23, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2174

[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2174

Uh oh!

underfituu commented Aug 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

codecov bot commented Aug 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

weijinqian0 commented Sep 15, 2025

Uh oh!

wangxiyuan commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2174

[Bugfix][PD] Auto-clear producer KV cache if no pull notification #2174

Uh oh!

Conversation

underfituu commented Aug 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 2, 2025

Uh oh!

codecov bot commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

weijinqian0 commented Sep 15, 2025

Uh oh!

wangxiyuan commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

underfituu commented Aug 2, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 2, 2025 •

edited

Loading