-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Core] Use tuple for kv cache group block ids #19175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @njhill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR primarily focuses on changing the data structure used for representing KV cache group block IDs from a list of lists (list[list[int]]) to a tuple of lists (tuple[list[int]]). The author, @njhill, explains that a tuple is a more appropriate choice because the number of KV cache groups is fixed, and tuples offer better performance compared to lists for this use case. This change required updates across various core components, including the KV cache manager, scheduler, and worker runners. As a beneficial side effect, this refactor also surfaced and fixed a bug in the TPU model runner's handling of block IDs, bringing its logic in line with the GPU model runner. Additionally, an assertion related to KV connectors only supporting a single KV cache group has been moved to the scheduler's constructor for earlier validation.
Highlights
- Data Structure Change: The representation of KV cache group block IDs has been changed from
list[list[int]]totuple[list[int]]for improved clarity and potential performance benefits, reflecting the fixed number of KV cache groups. - Bug Fix: A bug in
vllm/v1/worker/tpu_model_runner.pyrelated to appending new block IDs was identified and fixed as part of this change, ensuring correct block ID handling similar to the GPU runner. - Code Structure Improvement: The assertion checking for a single KV cache group when using KV connectors has been moved from internal methods to the
Schedulerconstructor (vllm/v1/core/sched/scheduler.py) for earlier validation. - Type Hint and Logic Updates: Corresponding type hints and logic across multiple files (
kv_cache_manager.py,output.py,scheduler.py,block_table.py,gpu_input_batch.py,gpu_model_runner.py,tpu_model_runner.py) have been updated to accommodate the new tuple structure.
Changelog
Click here to see the changelog
- tests/v1/core/test_prefix_caching.py
- Updated various assertions in test functions (
test_prefill,test_prefill_plp,test_decode,test_evict,test_mm_prefix_caching,test_cache_key_salting,test_reset_prefix_cache) to expect theget_block_ids()method to return a tuple containing a list, instead of a list containing a list.
- Updated various assertions in test functions (
- vllm/v1/core/kv_cache_manager.py
- Changed the return type hint of
KVCacheBlocks.get_block_idsfromlist[list[int]]totuple[list[int]](line 36). - Updated the return value of
KVCacheBlocks.get_block_idsto return a tuple containing the list of block IDs (line 45). - Changed the return type hint of
KVCacheManager.get_block_idsfromlist[list[int]]totuple[list[int]](line 369).
- Changed the return type hint of
- vllm/v1/core/sched/output.py
- Updated the type hint for
block_idsinNewRequestDatafromlist[list[int]]totuple[list[int]](line 30). - Updated the type hint for the
block_idsparameter inNewRequestData.from_requestfromlist[list[int]]totuple[list[int]](line 38). - Updated the type hint for
new_block_idsinCachedRequestDatafromlist[list[int]]totuple[list[int]](line 89). - Updated the type hint for the
new_block_idsparameter inCachedRequestData.from_requestfromlist[list[int]]totuple[list[int]](line 98).
- Updated the type hint for
- vllm/v1/core/sched/scheduler.py
- Moved the assertion checking for a single KV cache group when a KV connector is configured to the
Scheduler.__init__method (lines 79-81). - Changed the type hint for
req_to_new_block_idsfromdict[str, list[list[int]]]todict[str, tuple[list[int]]](line 183). - Fixed a typo in a comment:
prifix cached tokenschanged toprefix cached tokens(line 473). - Changed the type hint for the
new_block_idsparameter in_make_cached_request_datafromlist[list[int]]totuple[list[int]](line 590). - Updated the logic in
_connector_finishedto unpack the single element tuple returned byget_block_ids(line 991). - Updated the logic in
_update_waiting_for_remote_kvto unpack the single element tuple returned byget_block_ids(line 1011).
- Moved the assertion checking for a single KV cache group when a KV connector is configured to the
- vllm/v1/worker/block_table.py
- Changed the type hint for the
block_idsparameter inBlockTable.append_rowfromlist[list[int]]totuple[list[int]](line 115). - Changed the type hint for the
block_idsparameter inBlockTable.add_rowfromlist[list[int]]totuple[list[int]](line 119).
- Changed the type hint for the
- vllm/v1/worker/gpu_input_batch.py
- Updated the type hint for
block_idsinCachedRequestStatefromlist[list[int]]totuple[list[int]](line 33).
- Updated the type hint for
- vllm/v1/worker/gpu_model_runner.py
- Updated the loop for appending new block IDs in
_update_statesto usezipwithstrict=Trueto correctly handle the tuple structure (lines 463-466).
- Updated the loop for appending new block IDs in
- vllm/v1/worker/tpu_model_runner.py
- Updated the loop for appending new block IDs in
_update_statesto usezipwithstrict=Trueto correctly handle the tuple structure, fixing a bug where it previously expected a list of lists (lines 410-413).
- Updated the loop for appending new block IDs in
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
List can change its size,
Tuple fixed, a constant view,
Choose the right container.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a solid improvement. The change to use tuple for KV cache group block IDs is well-justified for immutability and potential efficiency gains. Key highlights of this PR include:
- Correctness: A significant bug in
tpu_model_runner.pyrelated to extending block IDs for multiple cache groups has been fixed. The corresponding logic ingpu_model_runner.pyhas also been updated for consistency and to correctly handle the new tuple type. - Robustness: An assertion regarding single KV cache group support for connectors has been moved to the
Schedulerconstructor for earlier failure detection. Additionally, a new assertion (assert self.connector is not None) has been added in_update_waiting_for_remote_kvto ensureself.connectoris initialized, enhancing robustness. - Clarity & Consistency: The codebase has been updated consistently to reflect the new
tuple[list[int]]type, including type hints and test assertions. - A minor typo was also corrected.
Overall, these changes improve code quality, correctness, and maintainability. The reasoning provided in the PR description is clear and the implementation appears sound.
Summary of Findings
- Bug Fix in TPU Model Runner: A critical bug in
tpu_model_runner.pythat incorrectly handled the extension of block IDs for multiple KV cache groups was fixed. The new implementation correctly iterates and extends the inner lists. - Improved Assertion Placement: The assertion for KV connector compatibility with single KV cache groups was moved to the
Schedulerconstructor, allowing for earlier detection of invalid configurations. - Enhanced Robustness in Scheduler: A new assertion was added in
_update_waiting_for_remote_kvto ensureself.connectoris notNone, improving code robustness. - Refactoring in GPU Model Runner: The logic for updating block IDs in
gpu_model_runner.pywas refactored to usezipwithstrict=True, making it cleaner and safer for the new tuple-based type. - Typo Correction (Low Severity - Not Commented): A typo 'prifix' was corrected to 'prefix' in
vllm/v1/core/sched/scheduler.py. - Type Hint and Test Updates (Low Severity - Not Commented): Type hints and test assertions were consistently updated across multiple files to reflect the change from
list[list[int]]totuple[list[int]]. - Tuple Unpacking Adaptation (Low Severity - Not Commented): Code in
scheduler.pywas updated to use tuple unpacking(block_ids, ) = ...to correctly handle the new return type ofget_block_ids.
Merge Readiness
This pull request is in excellent shape. The core change to use tuples is well-implemented, and the associated bug fix in the TPU runner is a significant improvement. The refactoring and robustness enhancements are also valuable. I believe these changes are ready for merging after addressing any minor points from the review comments, if any. As an AI, I am not authorized to approve pull requests, so please ensure other reviewers take a look before merging.
vllm/v1/worker/tpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent catch and a critical fix! The previous req_state.block_ids.extend(req_data.new_block_ids) would indeed have behaved incorrectly if req_state.block_ids was a list of lists (representing multiple cache groups), as it would append the inner lists of req_data.new_block_ids to the outer list of req_state.block_ids, rather than extending each corresponding inner list. With the change to tuple[list[int]], the old code would have raised an AttributeError anyway since tuples don't have an extend method.
The new loop using zip correctly iterates through the corresponding inner lists and extends them. Using strict=True is also a good practice here. This significantly improves the correctness for multi-group KV cache scenarios on TPUs.
vllm/v1/core/sched/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the assertion len(self.kv_cache_config.kv_cache_groups) == 1 to the constructor is a good improvement. It ensures that an invalid configuration with KV connectors and multiple cache groups is caught early during scheduler initialization, rather than potentially failing later during runtime. This enhances the robustness of the system. Well done!
vllm/v1/core/sched/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding assert self.connector is not None here is a good defensive measure. Could you confirm if there were any scenarios observed or anticipated where _update_waiting_for_remote_kv might be called with self.connector being None? Even if not, this assertion improves the code's robustness by making assumptions explicit.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to use zip for iterating over req_state.block_ids and req_data.new_block_ids is a nice improvement for handling the new tuple[list[int]] type. It's more Pythonic and the use of strict=True is a good safeguard to ensure both iterables have the same length (i.e., same number of KV cache groups). This makes the code cleaner and safer.
|
Agreed with you! But would you mind to do this refractor after #17996? It will be difficult for me to rebase otherwise. |
|
@heheda12345 sure |
|
@heheda12345 I've opened a separate PR with just the bugfix part, which can hopefully be merged sooner: #19228 |
|
This pull request has merge conflicts that must be resolved before it can be |
IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise. As part of this refactoring I've included a few other adjacent code simplifications. Signed-off-by: Nick Hill <nhill@redhat.com>
|
@heheda12345 I've now rebased this, and also included some other code simplifications. Please take a look at your convenience, thanks! |
| # Merge the hit blocks of full attention and other attention. | ||
| hit_blocks = hit_blocks_other_attn | ||
| for group_id, blocks in enumerate(hit_blocks_full_attn): | ||
| # NOTE: there is only one full attention group in most cases. So | ||
| # the time complexity of insert is fine. | ||
| hit_blocks.insert(group_id, blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 I'm curious about why insert was used here to insert the full attn hit blocks at the beginning of the other attn list rather than just appending/extending the full attn blocks list with the other attn list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 clarified that this is actually a bug and it was intended to be
for group_id, blocks in zip(self.full_attention_group_ids, hit_blocks_full_attn):which now makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for huge effort of simplifying the hybrid allocator code.
vllm/v1/core/kv_cache_coordinator.py
Outdated
| # the time complexity of insert is fine. | ||
| hit_blocks.insert(group_id, blocks) | ||
| return hit_blocks, hit_length | ||
| return hit_blocks_full_attn + hit_blocks_other_attn, hit_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification. But I think it is a coincidence that full_attn_group_ids is smaller than other_attn_group_ids and prefer to make it a little bit more general. Does it make sense to you?
| return hit_blocks_full_attn + hit_blocks_other_attn, hit_length | |
| if self.full_attn_first: | |
| hit_blocks = hit_blocks_full_attn + hit_blocks_other_attn | |
| else: | |
| hit_blocks = hit_blocks_other_attn + hit_blocks_full_attn | |
| return hit_blocks, hit_length |
And also add a check in verify_and_split_kv_cache_groups:
if max(self.full_attn_group_ids) < min(self.other_attn_group_ids):
self.full_attn_first = True
elif max(self.other_attn_group_ids) < min(self.full_attn_group_ids):
self.full_attn_first = False
else:
raise ValueError(
"HybridKVCacheCoordinator assumes the full "
"attention group ids and other attention group ids "
"do not interleave, either full attention group ids "
"are before other attention group ids or vice versa."
"This is for simplifying merging hit_blocks_full_attn and "
"hit_blocks_other_attn to hit_blocks.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 sure, I can make these changes. Just to make sure I understand correctly - the existing logic also doesn't handle this case of the full attn group ids being larger?
Would it be simpler to just enforce this ordering, i.e. that full attention group ids always come before other attention groups? Rather than having to add the conditional logic you're describing above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the existing logic also doesn't handle this case of the full attn group ids being larger?
By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.
Would it be simpler to just enforce this ordering
I prefer not to enforce the order. The order is generated here
vllm/vllm/v1/core/kv_cache_utils.py
Line 827 in eaa2e51
| for layers in same_type_layers.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.
Actually no I meant your logic that inserts the full attn blocks in front of the other attn blocks... it wasn't obvious to me that the two orders were supported, but I may just be missing something.
I prefer not to enforce the order.
Sure, I'll update.
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| self.kv_cache_manager.req_to_block_hashes[request.request_id], | ||
| num_computed_tokens, | ||
| ) | ||
| self.kv_cache_manager.cache_blocks(request, num_computed_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache_blocks method should only be called when self.kv_cache_manager.enable_caching = True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hidva. To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?
I'll open a PR for that soon, or feel free to do so too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?
Yes, I chose the PR corresponding to the most recent change on this line for convenience. Sorry about that.
IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise.
As part of this refactoring I've included a few other adjacent code simplifications.