[Bugfix] Fix kvpool precision synchronization #4574

LCAIZJ · 2025-11-29T09:25:34Z

What this PR does / why we need it?

Fix kvpool precision synchronization
Issue #4412

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

github-actions · 2025-11-29T09:26:09Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request addresses a critical race condition related to KV cache saving. By moving the trigger for the save operation from execute_model to sample_tokens, it ensures that the KV cache is fully computed before being saved, preventing data corruption. Additionally, redundant and ineffective synchronization calls have been removed from the KV transfer threads, improving code clarity. The changes are correct and significantly improve the robustness of the KV pooling mechanism.

gemini-code-assist · 2025-11-29T09:27:53Z

vllm_ascend/worker/model_runner_v1.py

                    # tokens on the CPU, so they are run after bookkeeping.
                    propose_draft_token_ids(valid_sampled_token_ids)
-
+            self.maybe_wait_for_kv_save()


Moving self.maybe_wait_for_kv_save() to this location from execute_model is a critical fix for a race condition. Previously, the KV cache save operation could be triggered before the model's forward pass had completed, potentially leading to corrupted data being saved. By placing it here, after sampling operations that implicitly synchronize the device, we ensure the KV cache is fully populated and stable before initiating the save.

A minor suggestion for future improvement: the method name maybe_wait_for_kv_save is misleading as it appears to trigger an asynchronous save rather than waiting. Renaming it to something like trigger_kv_save_if_needed would improve code clarity.

gemini-code-assist · 2025-11-29T09:27:53Z

vllm_ascend/distributed/kvpool/kv_transfer.py

        addr_list_tp = addr_list[self.tp_rank % self.put_step::self.put_step]
        size_list_tp = size_list[self.tp_rank % self.put_step::self.put_step]
        if key_list_tp:
-            torch.npu.current_stream().synchronize()


The removal of torch.npu.current_stream().synchronize() is correct. This call was ineffective for synchronizing with the main model execution stream where the KV cache is produced, as it only synchronizes operations within the current thread's stream. Since there were no preceding NPU operations on this stream, the call was a no-op. The actual fix for the synchronization race condition is handled elsewhere by moving when the save operation is triggered. Removing this redundant synchronize() call cleans up the code.

gemini-code-assist · 2025-11-29T09:27:54Z

vllm_ascend/distributed/kvpool/kv_transfer.py

        addr_list_tp = addr_list[self.tp_rank % self.put_step::self.put_step]
        size_list_tp = size_list[self.tp_rank % self.put_step::self.put_step]
        if key_list_tp:
-            torch.npu.current_stream().synchronize()


Similar to the change in KVCacheStoreSendingThread, removing torch.npu.current_stream().synchronize() here is correct. The call was redundant and did not provide the necessary cross-stream synchronization. This change improves code clarity.

wangxiyuan · 2025-11-29T14:11:57Z

please fix the merge conflict

github-actions · 2025-11-29T14:11:59Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Co-authored-by: baxingpiaochong <771405853@qq.com> Signed-off-by: LCAIZJ <leichao139636@163.com>

LCAIZJ · 2025-11-29T15:25:54Z

please fix the merge conflict

It’s working now.

Signed-off-by: LCAIZJ <leichao139636@163.com>

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

LCAIZJ changed the title ~~[KVPool ] Solve precision synchronization~~ [bugfix] Solve kvpool precision synchronization Nov 29, 2025

LCAIZJ changed the title ~~[bugfix] Solve kvpool precision synchronization~~ [Bugfix] Fix kvpool precision synchronization Nov 29, 2025

wangxiyuan approved these changes Nov 29, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 29, 2025

solve precision synchronization

524d7b0

Co-authored-by: baxingpiaochong <771405853@qq.com> Signed-off-by: LCAIZJ <leichao139636@163.com>

LCAIZJ force-pushed the dev branch from 0165f56 to 524d7b0 Compare November 29, 2025 15:24

github-actions bot removed the merge-conflicts label Nov 29, 2025

delete synchronize

04b7188

Signed-off-by: LCAIZJ <leichao139636@163.com>

wangxiyuan merged commit ff70613 into vllm-project:main Nov 30, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix kvpool precision synchronization #4574

[Bugfix] Fix kvpool precision synchronization #4574

LCAIZJ commented Nov 29, 2025 •

edited by wangxiyuan

Loading

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

Pz1116 Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

gemini-code-assist bot Nov 29, 2025

Uh oh!

wangxiyuan commented Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

LCAIZJ commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Bugfix] Fix kvpool precision synchronization #4574

[Bugfix] Fix kvpool precision synchronization #4574

Conversation

LCAIZJ commented Nov 29, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Pz1116 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Nov 29, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

LCAIZJ commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LCAIZJ commented Nov 29, 2025 •

edited by wangxiyuan

Loading