Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Jun 5, 2025

IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise.

As part of this refactoring I've included a few other adjacent code simplifications.

@njhill njhill requested a review from heheda12345 June 5, 2025 02:52
@mergify mergify bot added v1 tpu Related to Google TPUs labels Jun 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @njhill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR primarily focuses on changing the data structure used for representing KV cache group block IDs from a list of lists (list[list[int]]) to a tuple of lists (tuple[list[int]]). The author, @njhill, explains that a tuple is a more appropriate choice because the number of KV cache groups is fixed, and tuples offer better performance compared to lists for this use case. This change required updates across various core components, including the KV cache manager, scheduler, and worker runners. As a beneficial side effect, this refactor also surfaced and fixed a bug in the TPU model runner's handling of block IDs, bringing its logic in line with the GPU model runner. Additionally, an assertion related to KV connectors only supporting a single KV cache group has been moved to the scheduler's constructor for earlier validation.

Highlights

  • Data Structure Change: The representation of KV cache group block IDs has been changed from list[list[int]] to tuple[list[int]] for improved clarity and potential performance benefits, reflecting the fixed number of KV cache groups.
  • Bug Fix: A bug in vllm/v1/worker/tpu_model_runner.py related to appending new block IDs was identified and fixed as part of this change, ensuring correct block ID handling similar to the GPU runner.
  • Code Structure Improvement: The assertion checking for a single KV cache group when using KV connectors has been moved from internal methods to the Scheduler constructor (vllm/v1/core/sched/scheduler.py) for earlier validation.
  • Type Hint and Logic Updates: Corresponding type hints and logic across multiple files (kv_cache_manager.py, output.py, scheduler.py, block_table.py, gpu_input_batch.py, gpu_model_runner.py, tpu_model_runner.py) have been updated to accommodate the new tuple structure.

Changelog

Click here to see the changelog
  • tests/v1/core/test_prefix_caching.py
    • Updated various assertions in test functions (test_prefill, test_prefill_plp, test_decode, test_evict, test_mm_prefix_caching, test_cache_key_salting, test_reset_prefix_cache) to expect the get_block_ids() method to return a tuple containing a list, instead of a list containing a list.
  • vllm/v1/core/kv_cache_manager.py
    • Changed the return type hint of KVCacheBlocks.get_block_ids from list[list[int]] to tuple[list[int]] (line 36).
    • Updated the return value of KVCacheBlocks.get_block_ids to return a tuple containing the list of block IDs (line 45).
    • Changed the return type hint of KVCacheManager.get_block_ids from list[list[int]] to tuple[list[int]] (line 369).
  • vllm/v1/core/sched/output.py
    • Updated the type hint for block_ids in NewRequestData from list[list[int]] to tuple[list[int]] (line 30).
    • Updated the type hint for the block_ids parameter in NewRequestData.from_request from list[list[int]] to tuple[list[int]] (line 38).
    • Updated the type hint for new_block_ids in CachedRequestData from list[list[int]] to tuple[list[int]] (line 89).
    • Updated the type hint for the new_block_ids parameter in CachedRequestData.from_request from list[list[int]] to tuple[list[int]] (line 98).
  • vllm/v1/core/sched/scheduler.py
    • Moved the assertion checking for a single KV cache group when a KV connector is configured to the Scheduler.__init__ method (lines 79-81).
    • Changed the type hint for req_to_new_block_ids from dict[str, list[list[int]]] to dict[str, tuple[list[int]]] (line 183).
    • Fixed a typo in a comment: prifix cached tokens changed to prefix cached tokens (line 473).
    • Changed the type hint for the new_block_ids parameter in _make_cached_request_data from list[list[int]] to tuple[list[int]] (line 590).
    • Updated the logic in _connector_finished to unpack the single element tuple returned by get_block_ids (line 991).
    • Updated the logic in _update_waiting_for_remote_kv to unpack the single element tuple returned by get_block_ids (line 1011).
  • vllm/v1/worker/block_table.py
    • Changed the type hint for the block_ids parameter in BlockTable.append_row from list[list[int]] to tuple[list[int]] (line 115).
    • Changed the type hint for the block_ids parameter in BlockTable.add_row from list[list[int]] to tuple[list[int]] (line 119).
  • vllm/v1/worker/gpu_input_batch.py
    • Updated the type hint for block_ids in CachedRequestState from list[list[int]] to tuple[list[int]] (line 33).
  • vllm/v1/worker/gpu_model_runner.py
    • Updated the loop for appending new block IDs in _update_states to use zip with strict=True to correctly handle the tuple structure (lines 463-466).
  • vllm/v1/worker/tpu_model_runner.py
    • Updated the loop for appending new block IDs in _update_states to use zip with strict=True to correctly handle the tuple structure, fixing a bug where it previously expected a list of lists (lines 410-413).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


List can change its size,
Tuple fixed, a constant view,
Choose the right container.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Jun 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a solid improvement. The change to use tuple for KV cache group block IDs is well-justified for immutability and potential efficiency gains. Key highlights of this PR include:

  • Correctness: A significant bug in tpu_model_runner.py related to extending block IDs for multiple cache groups has been fixed. The corresponding logic in gpu_model_runner.py has also been updated for consistency and to correctly handle the new tuple type.
  • Robustness: An assertion regarding single KV cache group support for connectors has been moved to the Scheduler constructor for earlier failure detection. Additionally, a new assertion (assert self.connector is not None) has been added in _update_waiting_for_remote_kv to ensure self.connector is initialized, enhancing robustness.
  • Clarity & Consistency: The codebase has been updated consistently to reflect the new tuple[list[int]] type, including type hints and test assertions.
  • A minor typo was also corrected.

Overall, these changes improve code quality, correctness, and maintainability. The reasoning provided in the PR description is clear and the implementation appears sound.

Summary of Findings

  • Bug Fix in TPU Model Runner: A critical bug in tpu_model_runner.py that incorrectly handled the extension of block IDs for multiple KV cache groups was fixed. The new implementation correctly iterates and extends the inner lists.
  • Improved Assertion Placement: The assertion for KV connector compatibility with single KV cache groups was moved to the Scheduler constructor, allowing for earlier detection of invalid configurations.
  • Enhanced Robustness in Scheduler: A new assertion was added in _update_waiting_for_remote_kv to ensure self.connector is not None, improving code robustness.
  • Refactoring in GPU Model Runner: The logic for updating block IDs in gpu_model_runner.py was refactored to use zip with strict=True, making it cleaner and safer for the new tuple-based type.
  • Typo Correction (Low Severity - Not Commented): A typo 'prifix' was corrected to 'prefix' in vllm/v1/core/sched/scheduler.py.
  • Type Hint and Test Updates (Low Severity - Not Commented): Type hints and test assertions were consistently updated across multiple files to reflect the change from list[list[int]] to tuple[list[int]].
  • Tuple Unpacking Adaptation (Low Severity - Not Commented): Code in scheduler.py was updated to use tuple unpacking (block_ids, ) = ... to correctly handle the new return type of get_block_ids.

Merge Readiness

This pull request is in excellent shape. The core change to use tuples is well-implemented, and the associated bug fix in the TPU runner is a significant improvement. The refactoring and robustness enhancements are also valuable. I believe these changes are ready for merging after addressing any minor points from the review comments, if any. As an AI, I am not authorized to approve pull requests, so please ensure other reviewers take a look before merging.

Comment on lines 410 to 420
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is an excellent catch and a critical fix! The previous req_state.block_ids.extend(req_data.new_block_ids) would indeed have behaved incorrectly if req_state.block_ids was a list of lists (representing multiple cache groups), as it would append the inner lists of req_data.new_block_ids to the outer list of req_state.block_ids, rather than extending each corresponding inner list. With the change to tuple[list[int]], the old code would have raised an AttributeError anyway since tuples don't have an extend method.

The new loop using zip correctly iterates through the corresponding inner lists and extends them. Using strict=True is also a good practice here. This significantly improves the correctness for multi-group KV cache scenarios on TPUs.

Comment on lines 79 to 81
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Moving the assertion len(self.kv_cache_config.kv_cache_groups) == 1 to the constructor is a good improvement. It ensures that an invalid configuration with KV connectors and multiple cache groups is caught early during scheduler initialization, rather than potentially failing later during runtime. This enhances the robustness of the system. Well done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Adding assert self.connector is not None here is a good defensive measure. Could you confirm if there were any scenarios observed or anticipated where _update_waiting_for_remote_kv might be called with self.connector being None? Even if not, this assertion improves the code's robustness by making assumptions explicit.

Comment on lines 463 to 467
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change to use zip for iterating over req_state.block_ids and req_data.new_block_ids is a nice improvement for handling the new tuple[list[int]] type. It's more Pythonic and the use of strict=True is a good safeguard to ensure both iterables have the same length (i.e., same number of KV cache groups). This makes the code cleaner and safer.

@heheda12345
Copy link
Collaborator

Agreed with you! But would you mind to do this refractor after #17996? It will be difficult for me to rebase otherwise.

@njhill
Copy link
Member Author

njhill commented Jun 5, 2025

@heheda12345 sure

@njhill
Copy link
Member Author

njhill commented Jun 5, 2025

@heheda12345 I've opened a separate PR with just the bugfix part, which can hopefully be merged sooner: #19228

@mergify
Copy link

mergify bot commented Jun 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 6, 2025
IMO tuple is a better fit than list here since there will be a fixed number of elements corresponding to the number of kv cache groups. It's also more efficient than using a list performance-wise.

As part of this refactoring I've included a few other adjacent code simplifications.

Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill force-pushed the cache-group-tuple branch from 28355ed to 393c8f3 Compare June 6, 2025 23:37
@mergify mergify bot removed the needs-rebase label Jun 6, 2025
@njhill
Copy link
Member Author

njhill commented Jun 6, 2025

@heheda12345 I've now rebased this, and also included some other code simplifications. Please take a look at your convenience, thanks!

Comment on lines 331 to -341
# Merge the hit blocks of full attention and other attention.
hit_blocks = hit_blocks_other_attn
for group_id, blocks in enumerate(hit_blocks_full_attn):
# NOTE: there is only one full attention group in most cases. So
# the time complexity of insert is fine.
hit_blocks.insert(group_id, blocks)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345 I'm curious about why insert was used here to insert the full attn hit blocks at the beginning of the other attn list rather than just appending/extending the full attn blocks list with the other attn list?

Copy link
Member Author

@njhill njhill Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345 clarified that this is actually a bug and it was intended to be

        for group_id, blocks in zip(self.full_attention_group_ids, hit_blocks_full_attn):

which now makes sense to me!

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for huge effort of simplifying the hybrid allocator code.

# the time complexity of insert is fine.
hit_blocks.insert(group_id, blocks)
return hit_blocks, hit_length
return hit_blocks_full_attn + hit_blocks_other_attn, hit_length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great simplification. But I think it is a coincidence that full_attn_group_ids is smaller than other_attn_group_ids and prefer to make it a little bit more general. Does it make sense to you?

Suggested change
return hit_blocks_full_attn + hit_blocks_other_attn, hit_length
if self.full_attn_first:
hit_blocks = hit_blocks_full_attn + hit_blocks_other_attn
else:
hit_blocks = hit_blocks_other_attn + hit_blocks_full_attn
return hit_blocks, hit_length

And also add a check in verify_and_split_kv_cache_groups:

        if max(self.full_attn_group_ids) < min(self.other_attn_group_ids):
            self.full_attn_first = True
        elif max(self.other_attn_group_ids) < min(self.full_attn_group_ids):
            self.full_attn_first = False
        else:
            raise ValueError(
                "HybridKVCacheCoordinator assumes the full "
                "attention group ids and other attention group ids "
                "do not interleave, either full attention group ids "
                "are before other attention group ids or vice versa."
                "This is for simplifying merging hit_blocks_full_attn and "
                "hit_blocks_other_attn to hit_blocks.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345 sure, I can make these changes. Just to make sure I understand correctly - the existing logic also doesn't handle this case of the full attn group ids being larger?

Would it be simpler to just enforce this ordering, i.e. that full attention group ids always come before other attention groups? Rather than having to add the conditional logic you're describing above?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the existing logic also doesn't handle this case of the full attn group ids being larger?

By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.

Would it be simpler to just enforce this ordering

I prefer not to enforce the order. The order is generated here

for layers in same_type_layers.values():
, which is quite far from the kv_cache_manager, so I feel it may be hacky to enforce an order there just for reducing this branch. I prefer to make each module more self-contained.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "existing logic" do you mean the logic you implemented? If so, then your understanding is correct.

Actually no I meant your logic that inserts the full attn blocks in front of the other attn blocks... it wasn't obvious to me that the two orders were supported, but I may just be missing something.

I prefer not to enforce the order.

Sure, I'll update.

@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025
njhill added 2 commits June 9, 2025 12:18
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@WoosukKwon WoosukKwon added this to the v0.9.1 milestone Jun 9, 2025
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much.

@heheda12345 heheda12345 merged commit 646d62f into vllm-project:main Jun 10, 2025
64 checks passed
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

self.kv_cache_manager.req_to_block_hashes[request.request_id],
num_computed_tokens,
)
self.kv_cache_manager.cache_blocks(request, num_computed_tokens)
Copy link
Contributor

@hidva hidva Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache_blocks method should only be called when self.kv_cache_manager.enable_caching = True.

@njhill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hidva. To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?

I'll open a PR for that soon, or feel free to do so too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hidva #19435 I included you as co-author.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, this comment isn't about the specific changes in this PR but rather an existing bug in the kv connector related changes here that you noticed that should be fixed?

Yes, I chose the PR corresponding to the most recent change on this line for convenience. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants