-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[TPU] Optimize kv cache update kernel #20415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tengyifei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly optimizes the KV cache update kernel for TPU, particularly on v6e, by implementing advanced memory management strategies. The core improvements involve pipelining data transfers (DMAs) and dynamically adjusting the processing block size to maximize hardware utilization and throughput, resulting in a notable performance boost.
Highlights
- Performance Optimization: Pipelined DMA: The KV cache update kernel (
pallas_kv_cache_update.py) has been refactored to pipeline asynchronous memory copies (DMAs) into and out of VMEM. This allows for overlapping data transfers, significantly improving throughput by not waiting for all input copies to complete before starting any output copies. - Dynamic Optimal Slice Sizing: A new mechanism has been introduced to dynamically determine the optimal number of slices to process per kernel program instance. This calculation (
_get_num_slices_per_kv_cache_update_block) considers VMEM usage limits (32 MiB) and page size, rounding up to the next power of 2, and capping the value at 64 slices to prevent performance regressions observed at higher values (e.g., 128 slices on v6e). - Throughput Improvement: These optimizations have led to a substantial improvement in the throughput of the KV cache update kernel on v6e, increasing from 28.15 GB/s to 100.57 GB/s, as demonstrated by reference data.
- Code Refactoring and Robustness: The hardcoded
NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCKconstant has been removed and replaced with the dynamically calculated value throughout thetpu_model_runner.py. Additionally, the handling ofkv_cache_dtypehas been made more robust to ensure it's always atorch.dtypeobject.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance optimization for the TPU KV cache update kernel by pipelining DMA transfers and dynamically selecting the number of slices to copy. The changes are well-implemented and the logic for the optimization is sound.
I have one suggestion regarding code style and maintainability for the newly added helper function in vllm/v1/worker/tpu_model_runner.py to improve its clarity and consistency with the existing codebase.
Overall, this is a great improvement.
f9b07ef to
1a6c627
Compare
|
cc @yaochengji |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, Yifei! Left a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Could you sign the DCO?
|
No need to sign the DCO as we can override, but please fix the pre-commit |
b8d0089 to
1d61b0a
Compare
Done
Done |
Purpose ======= This PR improves the throughput of the kv cache update kernel from 28.15 GB/s to 100.57 GB/s on v6e. This PR adds to optimizations on top of vllm-project#19928: - Pipeline the DMAs into and out of VMEM - Pick the optimal number of slices to copy per kernel program instance Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb Test plan ========= Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine Test result =========== PASSED Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
1d61b0a to
22cdc7f
Compare
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
22cdc7f to
41586e2
Compare
Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Purpose
This PR improves the throughput of the kv cache update kernel from 28.15 GB/s to 91.65 GB/s on v6e.
This PR adds to optimizations on top of #19928. It picks the optimal number of slices to copy per kernel program instance based on results from microbenchmarks.
Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb
An earlier commit in this PR also pipelined the DMAs into and out of VMEM, resulting in more in-flight DMAs. That improved throughput to 100 GB/s in microbenchmarks but actually decreased the performance in
python3 ./benchmarks/benchmark_serving.py. I'm not sure why this happens. In any case, I omitted that optimization in the latest commit because what matters is the end-to-end benchmark.Test plan
Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py
Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine
Test result
PASSED
Llama 3.1 70B perf