-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager #18001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager #18001
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…mputed_tokens Signed-off-by: Chen Zhang <zhangch99@outlook.com>
The failing test looks related to KV cache |
@WoosukKwon Can you double-check the fix for NIXL compatibility? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the fix
vllm-project#18001 changed the behaviour subtly and broke some multi-connector cases. This change ensures we don't call the connector get_num_new_matched_tokens method a second time for a given request after an async load has completed. Signed-off-by: Nick Hill <nhill@redhat.com>
vllm-project#18001 changed the behaviour subtly and broke some multi-connector cases. This change ensures we don't call the connector get_num_new_matched_tokens method a second time for a given request after an async load has completed. Signed-off-by: Nick Hill <nhill@redhat.com>
* [BugFix] Fix handling of num_computed_tokens with connector vllm-project#18001 changed the behaviour subtly and broke some multi-connector cases. This change ensures we don't call the connector get_num_new_matched_tokens method a second time for a given request after an async load has completed. Signed-off-by: Nick Hill <nhill@redhat.com> * fix linting Signed-off-by: Nick Hill <nhill@redhat.com> * handle full cache hit on P/D decode worker case Signed-off-by: Nick Hill <nhill@redhat.com> * fix comment wording Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> --------- Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
…vllm-project#18001) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
For hybrid allocator, the logic to get num_computed_tokens from a KVCacheBlocks will be complex. To avoid computing that value in allocate_slots, we pass it from the return value of get_computed_blocks to the argument of allocate_slots