Skip to content

Conversation

@tpopp
Copy link

@tpopp tpopp commented Oct 24, 2025

Purpose

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE support was added to most models but not Llama4. This commit adds it, consistent with the others and due to it showing speedups for the same fusion of 3 Triton kernels and the following reshape_and_cache kernel, due to them usually being launch latency bound rather than compute bound.

Test Plan

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE={0,1} HIP_VISIBLE_DEVICES=0,2,4,5  vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8     --host localhost     --port 8000     --swap-space 64     --disable-log-requests     --dtype auto     --max-model-len 12024     --tensor-parallel-size 4     --max-num-seqs 1024     --distributed-executor-backend mp     --kv-cache-dtype fp8     --gpu-memory-utilization 0.4     --max-seq-len-to-capture 16384     --max-num-batched-tokens 131072     --no-enable-prefix-caching --async-scheduling

lm_eval --model local-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048 --tasks gsm8k --num_fewshot 5 --batch_size 64 --limit 2000 --apply_chat_templat

vllm bench serve --host localhost --port 8000   --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8   --dataset-name random   --random-input-len 1024   --random-output-len 8192   --max-concurrency 16   --num-prompts 96   --ignore-eos

Test Result

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=0

local-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 2000.0, num_fewshot: 5, batch_size: 64
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9598|±  |0.0054|
|     |       |strict-match    |     5|exact_match|↑  |0.8734|±  |0.0092|


 ============ Serving Benchmark Result ============
Successful requests:                     96
Maximum request concurrency:             16
Benchmark duration (s):                  870.26
Total input tokens:                      98024
Total generated tokens:                  786432
Request throughput (req/s):              0.11
Output token throughput (tok/s):         903.68
Total Token throughput (tok/s):          1016.31
---------------Time to First Token----------------
Mean TTFT (ms):                          14656.88
Median TTFT (ms):                        7410.19
P99 TTFT (ms):                           53137.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.71
Median TPOT (ms):                        15.40
P99 TPOT (ms):                           19.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.71
Median ITL (ms):                         8.71
P99 ITL (ms):                            9.43
==================================================

VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1

local-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 2000.0, num_fewshot: 5, batch_size: 64
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9568|±  |0.0056|
|     |       |strict-match    |     5|exact_match|↑  |0.8620|±  |0.0095|



============ Serving Benchmark Result ============
Successful requests:                     96
Maximum request concurrency:             16
Benchmark duration (s):                  723.48
Total input tokens:                      98024
Total generated tokens:                  786432
Request throughput (req/s):              0.13
Output token throughput (tok/s):         1087.01
Total Token throughput (tok/s):          1222.50
---------------Time to First Token----------------
Mean TTFT (ms):                          11019.44
Median TTFT (ms):                        2612.46
P99 TTFT (ms):                           45040.24
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.23
Median TPOT (ms):                        12.34
P99 TPOT (ms):                           16.37
---------------Inter-token Latency----------------
Mean ITL (ms):                           12.23
Median ITL (ms):                         8.83
P99 ITL (ms):                            9.17
==================================================

@tpopp tpopp force-pushed the llama4_fused_rope branch from b65124d to 44ef111 Compare October 24, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant