Skip to content

Conversation

@lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Jun 9, 2025

No description provided.

sarckk and others added 8 commits June 6, 2025 11:27
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Siyuan Liu <lsiyuan@google.com>

clean up

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

clean up

Signed-off-by: Siyuan Liu <lsiyuan@google.com>
@lsy323 lsy323 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2025
@github-actions
Copy link

github-actions bot commented Jun 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @lsy323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, Gemini here to provide a summary of this pull request. This PR focuses on fixing and enabling several tests within the TPU model runner specifically related to KV cache sharing. The main goal is to ensure the tests accurately validate the logic for initializing and estimating KV cache size and maximum context length when KV cache sharing is configured, addressing issues that caused these tests to be skipped previously.

Highlights

  • Test Unskipping: Several tests (test_init_kv_cache_with_kv_sharing_invalid_target_layer_order, test_init_kv_cache_with_kv_sharing_target_layer_not_exist, test_init_kv_cache_with_kv_sharing_target_same_as_current, test_init_kv_cache_without_kv_sharing, test_init_kv_cache_with_kv_sharing_valid) that were previously skipped on TPU have been unskipped.
  • Test Setup Refactoring: The test setup in test_tpu_model_runner.py has been refactored. Module-level patching and cleanup fixtures for torch_xla and PallasAttentionBackend have been removed. Helper functions get_vllm_config and get_model_runner were introduced to standardize the creation of VLLM config and model runner instances within tests.
  • KV Cache Calculation Corrections: The calculations and assertions within the KV cache initialization tests have been updated. This includes correcting the assumed page size for KV cache tensors (from 32KB to 512KB) and adjusting the expected number of blocks and maximum context length estimations accordingly for both shared and non-shared KV cache scenarios.

Changelog

Click here to see the changelog
  • tests/v1/tpu/worker/test_tpu_model_runner.py
    • Removed unused import unittest.mock.
    • Removed module-level patching of torch_xla and PallasAttentionBackend.
    • Removed cleanup_patches fixture.
    • Introduced get_vllm_config and get_model_runner helper functions for test setup.
    • Modified the model_runner pytest fixture to use the new helper functions.
    • Unskipped five tests related to KV cache sharing logic by removing @pytest.mark.skip.
    • Updated the unskipped tests to correctly use the model_runner fixture or explicitly create config/runner instances.
    • Added set_current_vllm_config context manager to relevant tests.
    • Corrected expected page size, block counts, and max context length calculations in KV cache tests.
    • Updated access method for KV cache tensors in assertions from dictionary lookup to list indexing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jun 9, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request focuses on fixing KV cache sharing tests for TPUs. The main changes involve removing global mocks, refactoring test fixtures, and updating test assertions to reflect new KV cache sizing calculations. The use of set_current_vllm_config seems to be a key part of the fix, ensuring that Attention layers are instantiated with the correct configuration context within the tests.

Overall, the changes look good and seem to make the tests runnable and more accurate. The updated comments explaining cache calculations are particularly helpful. There are a couple of minor points for clarification and cleanup.

Summary of Findings

  • Mocking Strategy: Global mocks for torch_xla and PallasAttentionBackend were removed. This is good if the test environment guarantees their availability, but clarification on this assumption would be helpful.
  • Test Logic and Clarity: The unskipping of tests and the use of set_current_vllm_config are positive changes. The updates to KV cache size calculations and assertions, along with improved comments, enhance test accuracy and readability.
  • Minor Cleanups: An outdated comment in the model_runner fixture and a couple of unused variables (vllm_ctx) were noted.

Merge Readiness

The pull request significantly improves the TPU KV cache sharing tests by making them runnable and updating their assertions. The refactoring of fixtures and removal of potentially problematic global mocks are good steps.

With minor clarifications on the mocking strategy and a few small cleanups (outdated comment, unused variables), this PR should be in good shape for merging. I am unable to approve the pull request, so please have other reviewers take a look and approve before merging.

Comment on lines +480 to +482
# page size for each layer KV can be calculated as
# 2 (non-MLA) * 8 (num_heads) * 128 (head_dim)
# * 2 (bfloat16, kv_cache dtype) * 128 (block_size) = 512KB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated comment explaining the page size calculation is much clearer and more detailed. This is a great improvement for understanding the test's assumptions!

Comment on lines +501 to +503
for kv_cache_tensor in kv_cache_config.kv_cache_tensors:
kv_cache_tensor.size = (
kv_cache_spec[kv_cache_tensor.shared_by[0]].page_size_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The change from iterating kv_cache_config.tensors (which was a dict) to kv_cache_config.kv_cache_tensors (which is a list) and using kv_cache_tensor.shared_by[0] to get the layer name for kv_cache_spec looks correct given the structure of KVCacheConfig and KVCacheTensor. This aligns well with how KV cache tensors can be defined and potentially shared.

Copy link
Collaborator

@sarckk sarckk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks

@lsy323 lsy323 requested review from heheda12345 and mgoin June 9, 2025 22:31
@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 9, 2025

Thanks so much for @sarckk's work in #19155. This fix is based on #19155, I just happen to have access to TPUs to verify the fix and sharpen the edges in the original PR.

@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 9, 2025

The tpu_runner tests is passing in CI https://buildkite.com/vllm/ci/builds/21712#019755c0-7bb1-43d1-b2f0-d836658ecc01 run

@lsy323
Copy link
Collaborator Author

lsy323 commented Jun 9, 2025

Hi @mgoin, could you please take a look at this fix? This should fix the tpu model runner tests. Thanks!

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements here :)

@mgoin mgoin merged commit 7d44c46 into vllm-project:main Jun 9, 2025
56 checks passed
@lsy323 lsy323 deleted the lsiyuan/fix-kv-cache-test branch June 9, 2025 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants