[Kernel] optimize kv cache update kernel block size #360

yaochengji · 2025-07-31T21:53:46Z

Description

Automatically adjust kv cache update kernel's num_slices_per_block, porting from [TPU] Optimize kv cache update kernel vllm#20415 and [TPU] fix kv_cache_update kernel block size choosing logic vllm#21007
change RPA kernel's vmem_limit_bytes to 100MB to avoid vmem OOM issue

Tests

pytest -s -v tests/kernels/ragged_kv_cache_update_test.py
server: TPU_BACKEND_TYPE=torchax VLLM_TORCHAX_ENABLED=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --gpu-memory-utilization 0.98 --max-num-batched-tokens 2048 --max-num-seqs 128 --max-model-len 2048 --no-enable-prefix-caching --tensor_parallel_size=1
client: python3 ./benchmarks/benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos

We can observe the throughput increases from 8.11 req/s to 8.14 req/s.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Chengji Yao <chengjiyao@google.com>

yaochengji · 2025-07-31T21:54:04Z

cc @lsy323 @xiangxu-google

xiangxu-google

Thanks!

bythew3i · 2025-08-01T00:11:08Z

tpu_commons/attention/backends/pallas_torchax.py

            num_kv_pages_per_block=None,
            num_queries_per_block=None,
-            vmem_limit_bytes=None,
+            vmem_limit_bytes=100 * 1024 * 1024,


Nit: better to test the throughput before and after this change. Kernel won't be affected but next op's prefetch will be affected.

It would crash due to vmem OOM before the change.

bythew3i · 2025-08-01T00:14:09Z

tpu_commons/kernels/ragged_kv_cache_update.py

    page_size: int,
    num_slices_per_block: int,
    dynamic_validate_inputs: bool,
+    vmem_limit_bytes: int = 40 * 1024 * 1024,


How is this calculated?

Basically I'd like to have 32MB vmem to be used for scratch buffer, and it's round up to 40MB.

bythew3i · 2025-08-01T00:16:42Z

tpu_commons/kernels/ragged_kv_cache_update.py

I would suggest - have an autotune and benchmarking in google workspace internally first. Like we do for RPA and quantized_matmul

Thanks for the suggestion! Since I'd like to make the kv cache update kernel on par with vLLM torch/xla path, we can add the auto-tuning and benchmarking later in google internally later?

bythew3i

LGTM - if the current change blocks you

[Kernel] optimize kv cache update kernel block size

9cff88f

Signed-off-by: Chengji Yao <chengjiyao@google.com>

yaochengji requested a review from bythew3i July 31, 2025 21:53

xiangxu-google approved these changes Jul 31, 2025

View reviewed changes

bythew3i reviewed Aug 1, 2025

View reviewed changes

bythew3i approved these changes Aug 1, 2025

View reviewed changes

yaochengji merged commit a54ea5a into main Aug 1, 2025
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] optimize kv cache update kernel block size #360

[Kernel] optimize kv cache update kernel block size #360

Uh oh!

yaochengji commented Jul 31, 2025

Uh oh!

yaochengji commented Jul 31, 2025

Uh oh!

xiangxu-google left a comment

Uh oh!

bythew3i Aug 1, 2025 •

edited

Loading

Uh oh!

yaochengji Aug 1, 2025

Uh oh!

bythew3i Aug 1, 2025

Uh oh!

yaochengji Aug 1, 2025

Uh oh!

bythew3i Aug 1, 2025

Uh oh!

yaochengji Aug 1, 2025

Uh oh!

bythew3i left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Kernel] optimize kv cache update kernel block size #360

[Kernel] optimize kv cache update kernel block size #360

Uh oh!

Conversation

yaochengji commented Jul 31, 2025

Description

Tests

Checklist

Uh oh!

yaochengji commented Jul 31, 2025

Uh oh!

xiangxu-google left a comment

Choose a reason for hiding this comment

Uh oh!

bythew3i Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaochengji Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bythew3i Aug 1, 2025 •

edited

Loading