[Core] Use tuple for kv cache group block ids #19175

njhill · 2025-06-05T02:51:52Z

IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise.

As part of this refactoring I've included a few other adjacent code simplifications.

gemini-code-assist

Hello @njhill, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR primarily focuses on changing the data structure used for representing KV cache group block IDs from a list of lists (list[list[int]]) to a tuple of lists (tuple[list[int]]). The author, @njhill, explains that a tuple is a more appropriate choice because the number of KV cache groups is fixed, and tuples offer better performance compared to lists for this use case. This change required updates across various core components, including the KV cache manager, scheduler, and worker runners. As a beneficial side effect, this refactor also surfaced and fixed a bug in the TPU model runner's handling of block IDs, bringing its logic in line with the GPU model runner. Additionally, an assertion related to KV connectors only supporting a single KV cache group has been moved to the scheduler's constructor for earlier validation.

Highlights

Data Structure Change: The representation of KV cache group block IDs has been changed from list[list[int]] to tuple[list[int]] for improved clarity and potential performance benefits, reflecting the fixed number of KV cache groups.
Bug Fix: A bug in vllm/v1/worker/tpu_model_runner.py related to appending new block IDs was identified and fixed as part of this change, ensuring correct block ID handling similar to the GPU runner.
Code Structure Improvement: The assertion checking for a single KV cache group when using KV connectors has been moved from internal methods to the Scheduler constructor (vllm/v1/core/sched/scheduler.py) for earlier validation.
Type Hint and Logic Updates: Corresponding type hints and logic across multiple files (kv_cache_manager.py, output.py, scheduler.py, block_table.py, gpu_input_batch.py, gpu_model_runner.py, tpu_model_runner.py) have been updated to accommodate the new tuple structure.

Changelog

Click here to see the changelog

tests/v1/core/test_prefix_caching.py
- Updated various assertions in test functions (test_prefill, test_prefill_plp, test_decode, test_evict, test_mm_prefix_caching, test_cache_key_salting, test_reset_prefix_cache) to expect the get_block_ids() method to return a tuple containing a list, instead of a list containing a list.
vllm/v1/core/kv_cache_manager.py
- Changed the return type hint of KVCacheBlocks.get_block_ids from list[list[int]] to tuple[list[int]] (line 36).
- Updated the return value of KVCacheBlocks.get_block_ids to return a tuple containing the list of block IDs (line 45).
- Changed the return type hint of KVCacheManager.get_block_ids from list[list[int]] to tuple[list[int]] (line 369).
vllm/v1/core/sched/output.py
- Updated the type hint for block_ids in NewRequestData from list[list[int]] to tuple[list[int]] (line 30).
- Updated the type hint for the block_ids parameter in NewRequestData.from_request from list[list[int]] to tuple[list[int]] (line 38).
- Updated the type hint for new_block_ids in CachedRequestData from list[list[int]] to tuple[list[int]] (line 89).
- Updated the type hint for the new_block_ids parameter in CachedRequestData.from_request from list[list[int]] to tuple[list[int]] (line 98).
vllm/v1/core/sched/scheduler.py
- Moved the assertion checking for a single KV cache group when a KV connector is configured to the Scheduler.__init__ method (lines 79-81).
- Changed the type hint for req_to_new_block_ids from dict[str, list[list[int]]] to dict[str, tuple[list[int]]] (line 183).
- Fixed a typo in a comment: prifix cached tokens changed to prefix cached tokens (line 473).
- Changed the type hint for the new_block_ids parameter in _make_cached_request_data from list[list[int]] to tuple[list[int]] (line 590).
- Updated the logic in _connector_finished to unpack the single element tuple returned by get_block_ids (line 991).
- Updated the logic in _update_waiting_for_remote_kv to unpack the single element tuple returned by get_block_ids (line 1011).
vllm/v1/worker/block_table.py
- Changed the type hint for the block_ids parameter in BlockTable.append_row from list[list[int]] to tuple[list[int]] (line 115).
- Changed the type hint for the block_ids parameter in BlockTable.add_row from list[list[int]] to tuple[list[int]] (line 119).
vllm/v1/worker/gpu_input_batch.py
- Updated the type hint for block_ids in CachedRequestState from list[list[int]] to tuple[list[int]] (line 33).
vllm/v1/worker/gpu_model_runner.py
- Updated the loop for appending new block IDs in _update_states to use zip with strict=True to correctly handle the tuple structure (lines 463-466).
vllm/v1/worker/tpu_model_runner.py
- Updated the loop for appending new block IDs in _update_states to use zip with strict=True to correctly handle the tuple structure, fixing a bug where it previously expected a list of lists (lines 410-413).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

List can change its size,
Tuple fixed, a constant view,
Choose the right container.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-06-05T02:52:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request is a solid improvement. The change to use tuple for KV cache group block IDs is well-justified for immutability and potential efficiency gains. Key highlights of this PR include:

Correctness: A significant bug in tpu_model_runner.py related to extending block IDs for multiple cache groups has been fixed. The corresponding logic in gpu_model_runner.py has also been updated for consistency and to correctly handle the new tuple type.
Robustness: An assertion regarding single KV cache group support for connectors has been moved to the Scheduler constructor for earlier failure detection. Additionally, a new assertion (assert self.connector is not None) has been added in _update_waiting_for_remote_kv to ensure self.connector is initialized, enhancing robustness.
Clarity & Consistency: The codebase has been updated consistently to reflect the new tuple[list[int]] type, including type hints and test assertions.
A minor typo was also corrected.

Overall, these changes improve code quality, correctness, and maintainability. The reasoning provided in the PR description is clear and the implementation appears sound.

Summary of Findings

Bug Fix in TPU Model Runner: A critical bug in tpu_model_runner.py that incorrectly handled the extension of block IDs for multiple KV cache groups was fixed. The new implementation correctly iterates and extends the inner lists.
Improved Assertion Placement: The assertion for KV connector compatibility with single KV cache groups was moved to the Scheduler constructor, allowing for earlier detection of invalid configurations.
Enhanced Robustness in Scheduler: A new assertion was added in _update_waiting_for_remote_kv to ensure self.connector is not None, improving code robustness.
Refactoring in GPU Model Runner: The logic for updating block IDs in gpu_model_runner.py was refactored to use zip with strict=True, making it cleaner and safer for the new tuple-based type.
Typo Correction (Low Severity - Not Commented): A typo 'prifix' was corrected to 'prefix' in vllm/v1/core/sched/scheduler.py.
Type Hint and Test Updates (Low Severity - Not Commented): Type hints and test assertions were consistently updated across multiple files to reflect the change from list[list[int]] to tuple[list[int]].
Tuple Unpacking Adaptation (Low Severity - Not Commented): Code in scheduler.py was updated to use tuple unpacking (block_ids, ) = ... to correctly handle the new return type of get_block_ids.

Merge Readiness

This pull request is in excellent shape. The core change to use tuples is well-implemented, and the associated bug fix in the TPU runner is a significant improvement. The refactoring and robustness enhancements are also valuable. I believe these changes are ready for merging after addressing any minor points from the review comments, if any. As an AI, I am not authorized to approve pull requests, so please ensure other reviewers take a look before merging.

gemini-code-assist · 2025-06-05T02:53:37Z

vllm/v1/worker/tpu_model_runner.py

This is an excellent catch and a critical fix! The previous req_state.block_ids.extend(req_data.new_block_ids) would indeed have behaved incorrectly if req_state.block_ids was a list of lists (representing multiple cache groups), as it would append the inner lists of req_data.new_block_ids to the outer list of req_state.block_ids, rather than extending each corresponding inner list. With the change to tuple[list[int]], the old code would have raised an AttributeError anyway since tuples don't have an extend method.

The new loop using zip correctly iterates through the corresponding inner lists and extends them. Using strict=True is also a good practice here. This significantly improves the correctness for multi-group KV cache scenarios on TPUs.

gemini-code-assist · 2025-06-05T02:53:37Z

vllm/v1/core/sched/scheduler.py

Moving the assertion len(self.kv_cache_config.kv_cache_groups) == 1 to the constructor is a good improvement. It ensures that an invalid configuration with KV connectors and multiple cache groups is caught early during scheduler initialization, rather than potentially failing later during runtime. This enhances the robustness of the system. Well done!

gemini-code-assist · 2025-06-05T02:53:37Z

vllm/v1/core/sched/scheduler.py

Adding assert self.connector is not None here is a good defensive measure. Could you confirm if there were any scenarios observed or anticipated where _update_waiting_for_remote_kv might be called with self.connector being None? Even if not, this assertion improves the code's robustness by making assumptions explicit.

gemini-code-assist · 2025-06-05T02:53:38Z

vllm/v1/worker/gpu_model_runner.py

This change to use zip for iterating over req_state.block_ids and req_data.new_block_ids is a nice improvement for handling the new tuple[list[int]] type. It's more Pythonic and the use of strict=True is a good safeguard to ensure both iterables have the same length (i.e., same number of KV cache groups). This makes the code cleaner and safer.

heheda12345 · 2025-06-05T03:27:32Z

Agreed with you! But would you mind to do this refractor after #17996? It will be difficult for me to rebase otherwise.

njhill · 2025-06-05T03:34:39Z

@heheda12345 sure

njhill · 2025-06-05T17:42:35Z

@heheda12345 I've opened a separate PR with just the bugfix part, which can hopefully be merged sooner: #19228

mergify · 2025-06-06T03:48:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise. As part of this refactoring I've included a few other adjacent code simplifications. Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-06-06T23:47:26Z

@heheda12345 I've now rebased this, and also included some other code simplifications. Please take a look at your convenience, thanks!

njhill · 2025-06-06T23:49:34Z

vllm/v1/core/kv_cache_coordinator.py

        # Merge the hit blocks of full attention and other attention.
-        hit_blocks = hit_blocks_other_attn
-        for group_id, blocks in enumerate(hit_blocks_full_attn):
-            # NOTE: there is only one full attention group in most cases. So
-            # the time complexity of insert is fine.
-            hit_blocks.insert(group_id, blocks)


@heheda12345 I'm curious about why insert was used here to insert the full attn hit blocks at the beginning of the other attn list rather than just appending/extending the full attn blocks list with the other attn list?

@heheda12345 clarified that this is actually a bug and it was intended to be

for group_id, blocks in zip(self.full_attention_group_ids, hit_blocks_full_attn):

which now makes sense to me!

heheda12345

Thank you very much for huge effort of simplifying the hybrid allocator code.

heheda12345 · 2025-06-07T03:17:31Z

vllm/v1/core/kv_cache_coordinator.py

-            # the time complexity of insert is fine.
-            hit_blocks.insert(group_id, blocks)
-        return hit_blocks, hit_length
+        return hit_blocks_full_attn + hit_blocks_other_attn, hit_length


Great simplification. But I think it is a coincidence that full_attn_group_ids is smaller than other_attn_group_ids and prefer to make it a little bit more general. Does it make sense to you?

Suggested change

return hit_blocks_full_attn + hit_blocks_other_attn, hit_length

if self.full_attn_first:

hit_blocks = hit_blocks_full_attn + hit_blocks_other_attn

else:

hit_blocks = hit_blocks_other_attn + hit_blocks_full_attn

return hit_blocks, hit_length

And also add a check in verify_and_split_kv_cache_groups:

if max(self.full_attn_group_ids) < min(self.other_attn_group_ids): self.full_attn_first = True elif max(self.other_attn_group_ids) < min(self.full_attn_group_ids): self.full_attn_first = False else: raise ValueError( "HybridKVCacheCoordinator assumes the full " "attention group ids and other attention group ids " "do not interleave, either full attention group ids " "are before other attention group ids or vice versa." "This is for simplifying merging hit_blocks_full_attn and " "hit_blocks_other_attn to hit_blocks.")

@heheda12345 sure, I can make these changes. Just to make sure I understand correctly - the existing logic also doesn't handle this case of the full attn group ids being larger?

Would it be simpler to just enforce this ordering, i.e. that full attention group ids always come before other attention groups? Rather than having to add the conditional logic you're describing above?

the existing logic also doesn't handle this case of the full attn group ids being larger?

By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.

Would it be simpler to just enforce this ordering

I prefer not to enforce the order. The order is generated here

vllm/vllm/v1/core/kv_cache_utils.py

Line 827 in eaa2e51

for layers in same_type_layers.values():

, which is quite far from the kv_cache_manager, so I feel it may be hacky to enforce an order there just for reducing this branch. I prefer to make each module more self-contained.

By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.

Actually no I meant your logic that inserts the full attn blocks in front of the other attn blocks... it wasn't obvious to me that the two orders were supported, but I may just be missing something.

I prefer not to enforce the order.

Sure, I'll update.

Signed-off-by: Nick Hill <nhill@redhat.com>

heheda12345

LGTM. Thank you very much.

gemini-code-assist · 2025-06-10T05:01:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hidva · 2025-06-10T14:14:25Z

vllm/v1/core/sched/scheduler.py

-            self.kv_cache_manager.req_to_block_hashes[request.request_id],
-            num_computed_tokens,
-        )
+        self.kv_cache_manager.cache_blocks(request, num_computed_tokens)


The cache_blocks method should only be called when self.kv_cache_manager.enable_caching = True.

@njhill

Thanks @hidva. To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?

I'll open a PR for that soon, or feel free to do so too!

@hidva #19435 I included you as co-author.

To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?

Yes, I chose the PR corresponding to the most recent change on this line for convenience. Sorry about that.

njhill requested review from WoosukKwon, alexm-redhat, comaniac, robertgshaw2-redhat and ywang96 as code owners June 5, 2025 02:51

njhill requested a review from heheda12345 June 5, 2025 02:52

mergify bot added v1 tpu Related to Google TPUs labels Jun 5, 2025

gemini-code-assist bot reviewed Jun 5, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Jun 6, 2025

njhill force-pushed the cache-group-tuple branch from 28355ed to 393c8f3 Compare June 6, 2025 23:37

mergify bot removed the needs-rebase label Jun 6, 2025

njhill commented Jun 6, 2025

View reviewed changes

heheda12345 reviewed Jun 7, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025

njhill added 2 commits June 9, 2025 12:18

fix test_prefix_caching.py

18481af

Signed-off-by: Nick Hill <nhill@redhat.com>

address review comment from @heheda12345

f888c47

Signed-off-by: Nick Hill <nhill@redhat.com>

WoosukKwon added this to the v0.9.1 milestone Jun 9, 2025

heheda12345 approved these changes Jun 10, 2025

View reviewed changes

heheda12345 merged commit 646d62f into vllm-project:main Jun 10, 2025
64 checks passed

hidva reviewed Jun 10, 2025

View reviewed changes

njhill deleted the cache-group-tuple branch June 10, 2025 15:50

njhill mentioned this pull request Jun 10, 2025

[BugFix] Honor enable_caching in connector-delayed kvcache load case #19435

Merged

Shirley125 mentioned this pull request Sep 28, 2025

[Bugfix]fix dataType of new_block_ids when update request tracker for vLLM 0.9.0 compatibility LMCache/LMCache#1701

Open

-        return hit_blocks_full_attn + hit_blocks_other_attn, hit_length
+        if self.full_attn_first:
+            hit_blocks = hit_blocks_full_attn + hit_blocks_other_attn
+        else:
+            hit_blocks =  hit_blocks_other_attn + hit_blocks_full_attn
+        return hit_blocks, hit_length

Uh oh!

[Core] Use tuple for kv cache group block ids #19175

[Core] Use tuple for kv cache group block ids #19175

Uh oh!

Conversation

njhill commented Jun 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Jun 5, 2025

Uh oh!

njhill commented Jun 5, 2025

Uh oh!

njhill commented Jun 5, 2025

Uh oh!

mergify bot commented Jun 6, 2025

Uh oh!

njhill commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 10, 2025

Uh oh!

hidva Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

njhill commented Jun 5, 2025 •

edited by github-actions bot

Loading

njhill Jun 9, 2025 •

edited

Loading

hidva Jun 10, 2025 •

edited

Loading