-
Notifications
You must be signed in to change notification settings - Fork 617
[Bugfix] Fix kvpool precision synchronization #4574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical race condition related to KV cache saving. By moving the trigger for the save operation from execute_model to sample_tokens, it ensures that the KV cache is fully computed before being saved, preventing data corruption. Additionally, redundant and ineffective synchronization calls have been removed from the KV transfer threads, improving code clarity. The changes are correct and significantly improve the robustness of the KV pooling mechanism.
| # tokens on the CPU, so they are run after bookkeeping. | ||
| propose_draft_token_ids(valid_sampled_token_ids) | ||
|
|
||
| self.maybe_wait_for_kv_save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving self.maybe_wait_for_kv_save() to this location from execute_model is a critical fix for a race condition. Previously, the KV cache save operation could be triggered before the model's forward pass had completed, potentially leading to corrupted data being saved. By placing it here, after sampling operations that implicitly synchronize the device, we ensure the KV cache is fully populated and stable before initiating the save.
A minor suggestion for future improvement: the method name maybe_wait_for_kv_save is misleading as it appears to trigger an asynchronous save rather than waiting. Renaming it to something like trigger_kv_save_if_needed would improve code clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| addr_list_tp = addr_list[self.tp_rank % self.put_step::self.put_step] | ||
| size_list_tp = size_list[self.tp_rank % self.put_step::self.put_step] | ||
| if key_list_tp: | ||
| torch.npu.current_stream().synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of torch.npu.current_stream().synchronize() is correct. This call was ineffective for synchronizing with the main model execution stream where the KV cache is produced, as it only synchronizes operations within the current thread's stream. Since there were no preceding NPU operations on this stream, the call was a no-op. The actual fix for the synchronization race condition is handled elsewhere by moving when the save operation is triggered. Removing this redundant synchronize() call cleans up the code.
| addr_list_tp = addr_list[self.tp_rank % self.put_step::self.put_step] | ||
| size_list_tp = size_list[self.tp_rank % self.put_step::self.put_step] | ||
| if key_list_tp: | ||
| torch.npu.current_stream().synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
please fix the merge conflict |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Co-authored-by: baxingpiaochong <771405853@qq.com> Signed-off-by: LCAIZJ <leichao139636@163.com>
It’s working now. |
Signed-off-by: LCAIZJ <leichao139636@163.com>
What this PR does / why we need it?
Fix kvpool precision synchronization
Issue #4412