[TPU] Increase block size and reset block shapes #16458

bythew3i · 2025-04-11T06:21:38Z

Increase kv cache block size and reset kernel block shapes based on autotuned results from kernel.
But still need to retune the kernel block shapes in kernel.

Note: we should wait for pytorch/xla#9041 to be checkin and update new torch_xla version in requirements.txt

Benchmarked without cache:

v6e-1 (single chip): 7.87 -> 8.37 req / sec

Benchmarking script:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct  --disable-log-requests --port 8003 --gpu-memory-utilization 0.95 --max-num-batched-tokens 512 --tensor-parallel-size 1 --max-model-len 2048 --max_num_seqs 512 &> /tmp/serve.log &

python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --port=8003 --random-input-len 1800 --random-output-len 128

Before:

============ Serving Benchmark Result ============
Successful requests:                     987       
Benchmark duration (s):                  125.47    
Total input tokens:                      1776600   
Total generated tokens:                  118669    
Request throughput (req/s):              7.87      
Output token throughput (tok/s):         945.80    
Total Token throughput (tok/s):          15105.47  
---------------Time to First Token----------------
Mean TTFT (ms):                          61168.93  
Median TTFT (ms):                        60913.25  
P99 TTFT (ms):                           121404.06 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.70     
Median TPOT (ms):                        31.98     
P99 TPOT (ms):                           32.56     
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.69     
Median ITL (ms):                         31.93     
P99 ITL (ms):                            33.08     
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     987       
Benchmark duration (s):                  117.96    
Total input tokens:                      1776600   
Total generated tokens:                  118669    
Request throughput (req/s):              8.37      
Output token throughput (tok/s):         1006.00   
Total Token throughput (tok/s):          16066.94  
---------------Time to First Token----------------
Mean TTFT (ms):                          57649.43  
Median TTFT (ms):                        57545.37  
P99 TTFT (ms):                           113943.35 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          29.78     
Median TPOT (ms):                        29.96     
P99 TPOT (ms):                           30.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.76     
Median ITL (ms):                         29.97     
P99 ITL (ms):                            30.93     
==================================================

v6e-8 (multi chip): 4.92 -> 5.42 req / sec

VLLM_USE_V1=1 vllm serve "meta-llama/Llama-3.1-70B" --download_dir "/root/.cache" --disable-log-requests --tensor_parallel_size=8 --max-model-len=2048 --gpu-memory-utilization 0.95 --max-num-batched-tokens 512 --max_num_seqs 512  &> /tmp/serve.log &


python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-70B --dataset-name random --port=8003 --random-input-len 1800 --random-output-len 128

Before:

============ Serving Benchmark Result ============
Successful requests:                     987       
Benchmark duration (s):                  200.42    
Total input tokens:                      1776600   
Total generated tokens:                  111817    
Request throughput (req/s):              4.92      
Output token throughput (tok/s):         557.91    
Total Token throughput (tok/s):          9422.22   
---------------Time to First Token----------------
Mean TTFT (ms):                          98990.46  
Median TTFT (ms):                        99069.74  
P99 TTFT (ms):                           195396.73 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.12     
Median TPOT (ms):                        51.44     
P99 TPOT (ms):                           52.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.27     
Median ITL (ms):                         51.38     
P99 ITL (ms):                            53.50     
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     987       
Benchmark duration (s):                  182.04    
Total input tokens:                      1776600   
Total generated tokens:                  111445    
Request throughput (req/s):              5.42      
Output token throughput (tok/s):         612.19    
Total Token throughput (tok/s):          10371.49  
---------------Time to First Token----------------
Mean TTFT (ms):                          89566.91  
Median TTFT (ms):                        89369.13  
P99 TTFT (ms):                           177012.56 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.24     
Median TPOT (ms):                        46.64     
P99 TPOT (ms):                           47.39     
---------------Inter-token Latency----------------
Mean ITL (ms):                           46.45     
Median ITL (ms):                         46.59     
P99 ITL (ms):                            48.17     
==================================================

github-actions · 2025-04-11T06:21:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/attention/backends/pallas.py

alexm-redhat · 2025-04-11T17:11:52Z

@bythew3i which model did you test on single and multi-chip setups?

alexm-redhat · 2025-04-11T17:12:23Z

vllm/platforms/tpu.py

256 block size seems a bit aggressive. Maybe you can try the sharegpt (it has average short prompts) benchmark and see if you don't see a regression.

And usually the page_size should not be larger than max_model_len.

re: @alexm-redhat I modified the code to calculate the block size based on the max-model-len. PTAL.

BTW, can you please share the cmds used for sharegpt benchmarking?

re: @yaochengji now the way how we choose page size should handle this. PTAL get_page_size in pallas.py

bythew3i · 2025-04-25T08:17:30Z

@bythew3i which model did you test on single and multi-chip setups?

Hi @alexm-redhat, sorry for the late reply. I benchmarked meta-llama/Llama-3.1-8B-Instruct and meta-llama/Llama-3.1-70B. I also updated benchmarking cmds in the PR description.

bythew3i · 2025-04-25T08:23:25Z

CC: @yarongmu-google @bvrockwell

yaochengji

LGTM, when CI is green. Thanks for the contribution.

vllm/v1/attention/backends/pallas.py

bythew3i · 2025-04-26T00:30:30Z

QQ: what is the cmd to format the code?

yaochengji · 2025-04-26T00:51:43Z

QQ: what is the cmd to format the code?

You can use pre-commit run --all-files, from https://docs.vllm.ai/en/stable/contributing/overview.html#testing

Sometimes there's still some lines not formatted, you can install a ruff plungin in vscode and format selected

bythew3i · 2025-05-01T05:48:34Z

@WoosukKwon @alexm-redhat PTAL! Thanks!

lsy323 · 2025-05-01T05:50:01Z

cc @mgoin

Akshat-Tripathi · 2025-05-01T13:17:53Z

vllm/v1/attention/backends/pallas.py

Hi, not a review, but this is interesting information, is there anywhere I can find it online?

Hi Akshat, thanks for asking! I can not find any TPU's SREGs number documented anywhere publicly. So I think it is better to not mention this in the comments.

Oh ok thanks!

mgoin

In trying to run a quick test I accidentally ran V0. It seems this PR breaks V0 by not specifying the block_size in that flow

  File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/utils.py", line 2463, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/worker/worker_base.py", line 594, in init_worker
    self.worker = worker_class(**kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/worker/tpu_worker.py", line 51, in __init__
    self.model_runner: TPUModelRunner = TPUModelRunner(
                                        ^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/worker/tpu_model_runner.py", line 111, in __init__
    self.max_num_blocks_per_seq = (self.model_config.max_model_len //
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'

        self.max_num_blocks_per_seq = (self.model_config.max_model_len //
                                       self.block_size)

vllm/platforms/tpu.py

bythew3i · 2025-05-02T05:38:25Z

@mgoin PTAL! Thanks!

Signed-off-by: Jevin Jiang <jevin0change@gmail.com>

mgoin

LGTM thank you!

bythew3i · 2025-05-05T19:40:05Z

@mgoin Can you please help merge this PR?

mgoin · 2025-05-05T20:10:01Z

@bythew3i the TPU V1 sampler test is failing https://buildkite.com/vllm/ci/builds/19212#01969083-3aed-45d6-9738-f4d601113fd5/6-1707
Can you see if you can reproduce locally? It is concerning that the outputs are all "!"

bythew3i · 2025-05-06T17:45:46Z

@bythew3i the TPU V1 sampler test is failing https://buildkite.com/vllm/ci/builds/19212#01969083-3aed-45d6-9738-f4d601113fd5/6-1707 Can you see if you can reproduce locally? It is concerning that the outputs are all "!"

@mgoin Is this the right cmd to test?

VLLM_USE_V1=1 pytest tests/v1/tpu/test_sampler.py

I tested on local... it also failed at main branch... let me pull the latest change to see if the failure still exist

bythew3i · 2025-05-06T17:52:45Z

@bythew3i the TPU V1 sampler test is failing https://buildkite.com/vllm/ci/builds/19212#01969083-3aed-45d6-9738-f4d601113fd5/6-1707 Can you see if you can reproduce locally? It is concerning that the outputs are all "!"

The error seems not related to this PR... It fails at HEAD on main branch @mgoin

* [Model] Add GraniteMoeHybrid 4.0 model (vllm-project#17497) Signed-off-by: Thomas Ortner <boh@zurich.ibm.com> Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com> Co-authored-by: Thomas Ortner <boh@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> * [easy] Fix logspam on PiecewiseBackend errors (vllm-project#17138) Signed-off-by: rzou <zou3519@gmail.com> * [Bugfix] Fixed prompt length for random dataset (vllm-project#17408) Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com> * [Doc] Update notes for H2O-VL and Gemma3 (vllm-project#17219) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Fix ScalarType float4 naming (vllm-project#17690) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> * Fix `dockerfilegraph` pre-commit hook (vllm-project#17698) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Fix triton import with local TritonPlaceholder (vllm-project#17446) Signed-off-by: Mengqing Cao <cmq0113@163.com> * [V1] Enable TPU V1 backend by default (vllm-project#17673) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][PP] Support PP for MultiprocExecutor (vllm-project#14219) Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: jiang.li <jiang1.li@intel.com> * [v1] AttentionMetadata for each layer (vllm-project#17394) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Feat] Add deprecated=True to CLI args (vllm-project#17426) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * [Docs] Use gh-file to add links to tool_calling.md (vllm-project#17709) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (vllm-project#17479) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [doc] Add RAG Integration example (vllm-project#17692) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> * [Bugfix] Fix modality limits in vision language example (vllm-project#17721) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Make right sidebar more readable in "Supported Models" (vllm-project#17723) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [TPU] Increase block size and reset block shapes (vllm-project#16458) * [Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py` (vllm-project#16839) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> * [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (vllm-project#17732) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> * [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (vllm-project#16828) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> --------- Signed-off-by: Thomas Ortner <boh@zurich.ibm.com> Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com> Signed-off-by: rzou <zou3519@gmail.com> Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: jiang.li <jiang1.li@intel.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: reidliu41 <reid201711@gmail.com> Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Signed-off-by: dtransposed <> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> Co-authored-by: Stan Wozniak <77159600+s3woz@users.noreply.github.com> Co-authored-by: Thomas Ortner <boh@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Mikhail Podvitskii <podvitskiymichael@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com> Co-authored-by: reidliu41 <reid201711@gmail.com> Co-authored-by: Jevin Jiang <jevin0change@gmail.com> Co-authored-by: d.transposed <damian.bogunowicz@gmail.com> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

bythew3i requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners April 11, 2025 06:21

mergify bot added v1 tpu Related to Google TPUs labels Apr 11, 2025

yaochengji reviewed Apr 11, 2025

View reviewed changes

vllm/v1/attention/backends/pallas.py Outdated Show resolved Hide resolved

alexm-redhat reviewed Apr 11, 2025

View reviewed changes

yaochengji approved these changes Apr 25, 2025

View reviewed changes

vanbasten23 reviewed Apr 25, 2025

View reviewed changes

vllm/v1/attention/backends/pallas.py Outdated Show resolved Hide resolved

vanbasten23 reviewed Apr 25, 2025

View reviewed changes

vllm/v1/attention/backends/pallas.py Outdated Show resolved Hide resolved

bythew3i force-pushed the ragged-jevinjiang branch 2 times, most recently from 8d890ed to 677fc5f Compare April 30, 2025 22:51

mergify bot added the ci/build label Apr 30, 2025

Akshat-Tripathi reviewed May 1, 2025

View reviewed changes

mgoin reviewed May 1, 2025

View reviewed changes

vllm/platforms/tpu.py Outdated Show resolved Hide resolved

mergify bot added the documentation Improvements or additions to documentation label May 2, 2025

bythew3i force-pushed the ragged-jevinjiang branch from 66696a6 to 284c605 Compare May 2, 2025 05:36

Set pagesize based on max-model-len

b6e7d0b

Signed-off-by: Jevin Jiang <jevin0change@gmail.com>

bythew3i force-pushed the ragged-jevinjiang branch from 284c605 to b6e7d0b Compare May 2, 2025 05:46

bythew3i requested a review from mgoin May 2, 2025 05:50

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 2, 2025

mgoin approved these changes May 2, 2025

View reviewed changes

mgoin merged commit 621ca2c into vllm-project:main May 6, 2025
71 checks passed

bythew3i mentioned this pull request May 7, 2025

[TPU] Fix the test_sampler #17820

Merged

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[TPU] Increase block size and reset block shapes (vllm-project#16458)

8bc1b68

Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

[TPU] Increase block size and reset block shapes (vllm-project#16458)

ac21ec5

QiliangCui mentioned this pull request May 17, 2025

[TPU] Calculate block size only when not set. #18292

Closed

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[TPU] Increase block size and reset block shapes (vllm-project#16458)

b085bbc

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Uh oh!

[TPU] Increase block size and reset block shapes #16458

[TPU] Increase block size and reset block shapes #16458

Uh oh!

Conversation

bythew3i commented Apr 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v6e-1 (single chip): 7.87 -> 8.37 req / sec

v6e-8 (multi chip): 4.92 -> 5.42 req / sec

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

Uh oh!

alexm-redhat commented Apr 11, 2025

Uh oh!

alexm-redhat Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bythew3i Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i commented Apr 25, 2025

Uh oh!

bythew3i commented Apr 25, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bythew3i commented Apr 26, 2025

Uh oh!

yaochengji commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bythew3i commented May 1, 2025

Uh oh!

lsy323 commented May 1, 2025

Uh oh!

Akshat-Tripathi May 1, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Akshat-Tripathi May 2, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bythew3i commented May 2, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

bythew3i commented May 5, 2025

Uh oh!

mgoin commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bythew3i commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bythew3i commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

bythew3i commented Apr 11, 2025 •

edited by github-actions bot

Loading

bythew3i Apr 25, 2025 •

edited

Loading

yaochengji commented Apr 26, 2025 •

edited

Loading

mgoin left a comment •

edited

Loading

mgoin commented May 5, 2025 •

edited

Loading

bythew3i commented May 6, 2025 •

edited

Loading