[ROCM] Llama4 VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support #763

tpopp · 2025-10-24T14:08:44Z

Purpose

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support was added to most models but not Llama4. This commit adds it, consistent with the others and due to it showing speedups for the same fusion of 3 Triton kernels and the following reshape_and_cache kernel, due to them usually being launch latency bound rather than compute bound.

Test Plan

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE={0,1} HIP_VISIBLE_DEVICES=0,2,4,5  vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8     --host localhost     --port 8000     --swap-space 64     --disable-log-requests     --dtype auto     --max-model-len 12024     --tensor-parallel-size 4     --max-num-seqs 1024     --distributed-executor-backend mp     --kv-cache-dtype fp8     --gpu-memory-utilization 0.4     --max-seq-len-to-capture 16384     --max-num-batched-tokens 131072     --no-enable-prefix-caching --async-scheduling

lm_eval --model local-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048 --tasks gsm8k --num_fewshot 5 --batch_size 64 --limit 2000 --apply_chat_templat

vllm bench serve --host localhost --port 8000   --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8   --dataset-name random   --random-input-len 1024   --random-output-len 8192   --max-concurrency 16   --num-prompts 96   --ignore-eos

Test Result

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=0

local-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 2000.0, num_fewshot: 5, batch_size: 64
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9598|±  |0.0054|
|     |       |strict-match    |     5|exact_match|↑  |0.8734|±  |0.0092|


 ============ Serving Benchmark Result ============
Successful requests:                     96
Maximum request concurrency:             16
Benchmark duration (s):                  870.26
Total input tokens:                      98024
Total generated tokens:                  786432
Request throughput (req/s):              0.11
Output token throughput (tok/s):         903.68
Total Token throughput (tok/s):          1016.31
---------------Time to First Token----------------
Mean TTFT (ms):                          14656.88
Median TTFT (ms):                        7410.19
P99 TTFT (ms):                           53137.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.71
Median TPOT (ms):                        15.40
P99 TPOT (ms):                           19.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.71
Median ITL (ms):                         8.71
P99 ITL (ms):                            9.43
==================================================

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1

local-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 2000.0, num_fewshot: 5, batch_size: 64
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.8620|±  |0.0095|



============ Serving Benchmark Result ============
Successful requests:                     96
Maximum request concurrency:             16
Benchmark duration (s):                  723.48
Total input tokens:                      98024
Total generated tokens:                  786432
Request throughput (req/s):              0.13
Output token throughput (tok/s):         1087.01
Total Token throughput (tok/s):          1222.50
---------------Time to First Token----------------
Mean TTFT (ms):                          11019.44
Median TTFT (ms):                        2612.46
P99 TTFT (ms):                           45040.24
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.23
Median TPOT (ms):                        12.34
P99 TPOT (ms):                           16.37
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.23
Median ITL (ms):                         8.83
P99 ITL (ms):                            9.17
==================================================

…ZEROS_KV_CACHE This can be used with VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1

tpopp added 3 commits October 22, 2025 11:00

Add limited Llama4 support for VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_…

a29eb07

…ZEROS_KV_CACHE This can be used with VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1

Add some comments and remove unneeded condition

506f798

Remove empty line

44ef111

tpopp force-pushed the llama4_fused_rope branch from b65124d to 44ef111 Compare October 24, 2025 14:20

tpopp and others added 3 commits October 24, 2025 16:22

Remove extra and statement

4f6c9c2

pre-commit fixes

2c29be2

Merge branch '355_wip' into llama4_fused_rope

dc3dfdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCM] Llama4 VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support #763

[ROCM] Llama4 VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support #763

Uh oh!

tpopp commented Oct 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[ROCM] Llama4 VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support #763

Are you sure you want to change the base?

[ROCM] Llama4 VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support #763

Uh oh!

Conversation

tpopp commented Oct 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tpopp commented Oct 24, 2025 •

edited by github-actions bot

Loading