Skip to content

Conversation

@tengyifei
Copy link
Contributor

@tengyifei tengyifei commented Jul 3, 2025

Purpose

This PR improves the throughput of the kv cache update kernel from 28.15 GB/s to 91.65 GB/s on v6e.

This PR adds to optimizations on top of #19928. It picks the optimal number of slices to copy per kernel program instance based on results from microbenchmarks.

Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb

An earlier commit in this PR also pipelined the DMAs into and out of VMEM, resulting in more in-flight DMAs. That improved throughput to 100 GB/s in microbenchmarks but actually decreased the performance in python3 ./benchmarks/benchmark_serving.py. I'm not sure why this happens. In any case, I omitted that optimization in the latest commit because what matters is the end-to-end benchmark.

Test plan

Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py
Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine

Test result

PASSED

Llama 3.1 70B perf

vllm serve meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --max-model-len 2048 --no-enable-prefix-caching --tensor_parallel_size=8

python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos
  • Before: 6.73 reqs/s
  • This PR: 6.77 reqs/s

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tengyifei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the KV cache update kernel for TPU, particularly on v6e, by implementing advanced memory management strategies. The core improvements involve pipelining data transfers (DMAs) and dynamically adjusting the processing block size to maximize hardware utilization and throughput, resulting in a notable performance boost.

Highlights

  • Performance Optimization: Pipelined DMA: The KV cache update kernel (pallas_kv_cache_update.py) has been refactored to pipeline asynchronous memory copies (DMAs) into and out of VMEM. This allows for overlapping data transfers, significantly improving throughput by not waiting for all input copies to complete before starting any output copies.
  • Dynamic Optimal Slice Sizing: A new mechanism has been introduced to dynamically determine the optimal number of slices to process per kernel program instance. This calculation (_get_num_slices_per_kv_cache_update_block) considers VMEM usage limits (32 MiB) and page size, rounding up to the next power of 2, and capping the value at 64 slices to prevent performance regressions observed at higher values (e.g., 128 slices on v6e).
  • Throughput Improvement: These optimizations have led to a substantial improvement in the throughput of the KV cache update kernel on v6e, increasing from 28.15 GB/s to 100.57 GB/s, as demonstrated by reference data.
  • Code Refactoring and Robustness: The hardcoded NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK constant has been removed and replaced with the dynamically calculated value throughout the tpu_model_runner.py. Additionally, the handling of kv_cache_dtype has been made more robust to ensure it's always a torch.dtype object.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jul 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for the TPU KV cache update kernel by pipelining DMA transfers and dynamically selecting the number of slices to copy. The changes are well-implemented and the logic for the optimization is sound.

I have one suggestion regarding code style and maintainability for the newly added helper function in vllm/v1/worker/tpu_model_runner.py to improve its clarity and consistency with the existing codebase.

Overall, this is a great improvement.

@tengyifei tengyifei force-pushed the optimize-index-copy branch 2 times, most recently from f9b07ef to 1a6c627 Compare July 3, 2025 06:48
@tengyifei
Copy link
Contributor Author

cc @yaochengji

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, Yifei! Left a few comments.

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Could you sign the DCO?

@mgoin
Copy link
Member

mgoin commented Jul 10, 2025

No need to sign the DCO as we can override, but please fix the pre-commit

@tengyifei tengyifei force-pushed the optimize-index-copy branch from b8d0089 to 1d61b0a Compare July 13, 2025 01:54
@tengyifei
Copy link
Contributor Author

Could you sign the DCO?

Done

please fix the pre-commit

Done

Purpose
=======

This PR improves the throughput of the kv cache update kernel
from 28.15 GB/s to 100.57 GB/s on v6e.

This PR adds to optimizations on top of
vllm-project#19928:

- Pipeline the DMAs into and out of VMEM
- Pick the optimal number of slices to copy per kernel program instance

Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb

Test plan
=========

Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py
Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine

Test result
===========

PASSED

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
@tengyifei tengyifei force-pushed the optimize-index-copy branch from 1d61b0a to 22cdc7f Compare July 13, 2025 01:59
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
@tengyifei tengyifei force-pushed the optimize-index-copy branch from 22cdc7f to 41586e2 Compare July 13, 2025 02:25
@yaochengji yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025
@yaochengji yaochengji enabled auto-merge (squash) July 14, 2025 18:33
@vllm-bot vllm-bot merged commit c586b55 into vllm-project:main Jul 15, 2025
79 of 81 checks passed
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025
Signed-off-by: Yifei Teng <tengyifei88@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants