-
-
Couldn't load subscription status.
- Fork 10.8k
[TPU]Fix KV cache sharing tests #19371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @lsy323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, Gemini here to provide a summary of this pull request. This PR focuses on fixing and enabling several tests within the TPU model runner specifically related to KV cache sharing. The main goal is to ensure the tests accurately validate the logic for initializing and estimating KV cache size and maximum context length when KV cache sharing is configured, addressing issues that caused these tests to be skipped previously.
Highlights
- Test Unskipping: Several tests (
test_init_kv_cache_with_kv_sharing_invalid_target_layer_order,test_init_kv_cache_with_kv_sharing_target_layer_not_exist,test_init_kv_cache_with_kv_sharing_target_same_as_current,test_init_kv_cache_without_kv_sharing,test_init_kv_cache_with_kv_sharing_valid) that were previously skipped on TPU have been unskipped. - Test Setup Refactoring: The test setup in
test_tpu_model_runner.pyhas been refactored. Module-level patching and cleanup fixtures fortorch_xlaandPallasAttentionBackendhave been removed. Helper functionsget_vllm_configandget_model_runnerwere introduced to standardize the creation of VLLM config and model runner instances within tests. - KV Cache Calculation Corrections: The calculations and assertions within the KV cache initialization tests have been updated. This includes correcting the assumed page size for KV cache tensors (from 32KB to 512KB) and adjusting the expected number of blocks and maximum context length estimations accordingly for both shared and non-shared KV cache scenarios.
Changelog
Click here to see the changelog
- tests/v1/tpu/worker/test_tpu_model_runner.py
- Removed unused import
unittest.mock. - Removed module-level patching of
torch_xlaandPallasAttentionBackend. - Removed
cleanup_patchesfixture. - Introduced
get_vllm_configandget_model_runnerhelper functions for test setup. - Modified the
model_runnerpytest fixture to use the new helper functions. - Unskipped five tests related to KV cache sharing logic by removing
@pytest.mark.skip. - Updated the unskipped tests to correctly use the
model_runnerfixture or explicitly create config/runner instances. - Added
set_current_vllm_configcontext manager to relevant tests. - Corrected expected page size, block counts, and max context length calculations in KV cache tests.
- Updated access method for KV cache tensors in assertions from dictionary lookup to list indexing.
- Removed unused import
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request focuses on fixing KV cache sharing tests for TPUs. The main changes involve removing global mocks, refactoring test fixtures, and updating test assertions to reflect new KV cache sizing calculations. The use of set_current_vllm_config seems to be a key part of the fix, ensuring that Attention layers are instantiated with the correct configuration context within the tests.
Overall, the changes look good and seem to make the tests runnable and more accurate. The updated comments explaining cache calculations are particularly helpful. There are a couple of minor points for clarification and cleanup.
Summary of Findings
- Mocking Strategy: Global mocks for
torch_xlaandPallasAttentionBackendwere removed. This is good if the test environment guarantees their availability, but clarification on this assumption would be helpful. - Test Logic and Clarity: The unskipping of tests and the use of
set_current_vllm_configare positive changes. The updates to KV cache size calculations and assertions, along with improved comments, enhance test accuracy and readability. - Minor Cleanups: An outdated comment in the
model_runnerfixture and a couple of unused variables (vllm_ctx) were noted.
Merge Readiness
The pull request significantly improves the TPU KV cache sharing tests by making them runnable and updating their assertions. The refactoring of fixtures and removal of potentially problematic global mocks are good steps.
With minor clarifications on the mocking strategy and a few small cleanups (outdated comment, unused variables), this PR should be in good shape for merging. I am unable to approve the pull request, so please have other reviewers take a look and approve before merging.
| # page size for each layer KV can be calculated as | ||
| # 2 (non-MLA) * 8 (num_heads) * 128 (head_dim) | ||
| # * 2 (bfloat16, kv_cache dtype) * 128 (block_size) = 512KB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for kv_cache_tensor in kv_cache_config.kv_cache_tensors: | ||
| kv_cache_tensor.size = ( | ||
| kv_cache_spec[kv_cache_tensor.shared_by[0]].page_size_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from iterating kv_cache_config.tensors (which was a dict) to kv_cache_config.kv_cache_tensors (which is a list) and using kv_cache_tensor.shared_by[0] to get the layer name for kv_cache_spec looks correct given the structure of KVCacheConfig and KVCacheTensor. This aligns well with how KV cache tensors can be defined and potentially shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks
|
The tpu_runner tests is passing in CI https://buildkite.com/vllm/ci/builds/21712#019755c0-7bb1-43d1-b2f0-d836658ecc01 run |
|
Hi @mgoin, could you please take a look at this fix? This should fix the tpu model runner tests. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvements here :)
No description provided.