[TPU] Optimize kv cache update kernel #20415

tengyifei · 2025-07-03T06:43:52Z

Purpose

This PR improves the throughput of the kv cache update kernel from 28.15 GB/s to 91.65 GB/s on v6e.

This PR adds to optimizations on top of #19928. It picks the optimal number of slices to copy per kernel program instance based on results from microbenchmarks.

Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb

An earlier commit in this PR also pipelined the DMAs into and out of VMEM, resulting in more in-flight DMAs. That improved throughput to 100 GB/s in microbenchmarks but actually decreased the performance in python3 ./benchmarks/benchmark_serving.py. I'm not sure why this happens. In any case, I omitted that optimization in the latest commit because what matters is the end-to-end benchmark.

Test plan

Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py
Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine

Test result

PASSED

Llama 3.1 70B perf

vllm serve meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --max-model-len 2048 --no-enable-prefix-caching --tensor_parallel_size=8

python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-70B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos

Before: 6.73 reqs/s
This PR: 6.77 reqs/s

github-actions · 2025-07-03T06:43:59Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @tengyifei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the KV cache update kernel for TPU, particularly on v6e, by implementing advanced memory management strategies. The core improvements involve pipelining data transfers (DMAs) and dynamically adjusting the processing block size to maximize hardware utilization and throughput, resulting in a notable performance boost.

Highlights

Performance Optimization: Pipelined DMA: The KV cache update kernel (pallas_kv_cache_update.py) has been refactored to pipeline asynchronous memory copies (DMAs) into and out of VMEM. This allows for overlapping data transfers, significantly improving throughput by not waiting for all input copies to complete before starting any output copies.
Dynamic Optimal Slice Sizing: A new mechanism has been introduced to dynamically determine the optimal number of slices to process per kernel program instance. This calculation (_get_num_slices_per_kv_cache_update_block) considers VMEM usage limits (32 MiB) and page size, rounding up to the next power of 2, and capping the value at 64 slices to prevent performance regressions observed at higher values (e.g., 128 slices on v6e).
Throughput Improvement: These optimizations have led to a substantial improvement in the throughput of the KV cache update kernel on v6e, increasing from 28.15 GB/s to 100.57 GB/s, as demonstrated by reference data.
Code Refactoring and Robustness: The hardcoded NUM_SLICES_PER_KV_CACHE_UPDATE_BLOCK constant has been removed and replaced with the dynamically calculated value throughout the tpu_model_runner.py. Additionally, the handling of kv_cache_dtype has been made more robust to ensure it's always a torch.dtype object.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for the TPU KV cache update kernel by pipelining DMA transfers and dynamically selecting the number of slices to copy. The changes are well-implemented and the logic for the optimization is sound.

I have one suggestion regarding code style and maintainability for the newly added helper function in vllm/v1/worker/tpu_model_runner.py to improve its clarity and consistency with the existing codebase.

Overall, this is a great improvement.

vllm/v1/worker/tpu_model_runner.py

tengyifei · 2025-07-03T06:49:22Z

cc @yaochengji

yaochengji

Thanks for your contribution, Yifei! Left a few comments.

vllm/v1/worker/tpu_model_runner.py

yaochengji

LGTM, thanks!

Could you sign the DCO?

mgoin · 2025-07-10T00:07:28Z

No need to sign the DCO as we can override, but please fix the pre-commit

tengyifei · 2025-07-13T01:55:19Z

Could you sign the DCO?

Done

please fix the pre-commit

Done

Purpose ======= This PR improves the throughput of the kv cache update kernel from 28.15 GB/s to 100.57 GB/s on v6e. This PR adds to optimizations on top of vllm-project#19928: - Pipeline the DMAs into and out of VMEM - Pick the optimal number of slices to copy per kernel program instance Reference data demonstrating improvements: https://github.com/tengyifei/playground/blob/master/pallas/better-index-copy.ipynb Test plan ========= Kernel test: pytest -s -v tests/v1/tpu/test_kv_cache_update_kernel.py Accuracy test: pytest -s -v tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine Test result =========== PASSED Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

tengyifei requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners July 3, 2025 06:43

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

mergify bot added v1 tpu Related to Google TPUs labels Jul 3, 2025

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

tengyifei force-pushed the optimize-index-copy branch 2 times, most recently from f9b07ef to 1a6c627 Compare July 3, 2025 06:48

yaochengji reviewed Jul 6, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

yaochengji approved these changes Jul 7, 2025

View reviewed changes

tengyifei force-pushed the optimize-index-copy branch from b8d0089 to 1d61b0a Compare July 13, 2025 01:54

tengyifei added 3 commits July 12, 2025 18:58

fix lint

9849f5c

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

Revert the DMA pipelining

774d4c6

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

tengyifei force-pushed the optimize-index-copy branch from 1d61b0a to 22cdc7f Compare July 13, 2025 01:59

Address comment

41586e2

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

tengyifei force-pushed the optimize-index-copy branch from 22cdc7f to 41586e2 Compare July 13, 2025 02:25

yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025

yaochengji enabled auto-merge (squash) July 14, 2025 18:33

vllm-bot merged commit c586b55 into vllm-project:main Jul 15, 2025
79 of 81 checks passed

yaochengji mentioned this pull request Jul 31, 2025

[Kernel] optimize kv cache update kernel block size vllm-project/tpu-inference#360

Merged

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

41a1a96

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

55da03e

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

39b2613

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

25a664f

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

d7b1206

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

32ebd77

Signed-off-by: Yifei Teng <tengyifei88@gmail.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 27, 2025

[TPU] Optimize kv cache update kernel (vllm-project#20415)

632c5db

Signed-off-by: Yifei Teng <tengyifei88@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU] Optimize kv cache update kernel #20415

[TPU] Optimize kv cache update kernel #20415

Uh oh!

tengyifei commented Jul 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

tengyifei commented Jul 3, 2025

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaochengji left a comment

Uh oh!

mgoin commented Jul 10, 2025

Uh oh!

tengyifei commented Jul 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[TPU] Optimize kv cache update kernel #20415

[TPU] Optimize kv cache update kernel #20415

Uh oh!

Conversation

tengyifei commented Jul 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test plan

Test result

Llama 3.1 70B perf

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tengyifei commented Jul 3, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jul 10, 2025

Uh oh!

tengyifei commented Jul 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tengyifei commented Jul 3, 2025 •

edited by github-actions bot

Loading