[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486

BBuf · 2024-12-15T05:40:57Z

Motivation

When deploying qwen2-7b, I compared the nsys profile results between vllm and sglang and found that rmsnorm in vllm was somewhat slower. To easily verify the kernel performance of rmsnorm with different input tensor sizes, I submitted this benchmark script. The script allows us to test the performance of rmsnorm kernels with or without residual connections by using the --use_residual flag. Below are the results obtained on an NVIDIA GeForce RTX 4090 GPU. As shown, the flashinfer rmsnorm kernels achieved much better performance across almost all shapes specified in the script.

rmsnorm without residual

python3 benchmark/kernels/rmsnorm/benchmark_fused_rms_norm.py

benchmark result:

✅ All implementations match
rmsnorm-performance-without-residual:
    head_num  batch_size  seq_len   HuggingFace   FlashInfer         VLLM
0       32.0         1.0     64.0     30.719999     9.216000    10.240000
1       32.0         1.0    128.0     35.840001    12.288000    13.312000
2       32.0         1.0    256.0     41.983999    14.336000    16.384000
3       32.0         1.0    512.0     60.416002    20.479999    25.599999
4       32.0         1.0   1024.0     97.280003    34.816001    41.983999
5       32.0         4.0     64.0     43.008000    13.312000    16.384000
6       32.0         4.0    128.0     60.416002    20.479999    25.599999
7       32.0         4.0    256.0     97.280003    34.816001    41.983999
8       32.0         4.0    512.0    177.151993    60.416002    70.656002
9       32.0         4.0   1024.0    632.831991   113.664001   130.048007
10      32.0        16.0     64.0     97.280003    34.816001    41.983999
11      32.0        16.0    128.0    177.151993    59.392001    70.656002
12      32.0        16.0    256.0    632.831991   113.664001   130.048007
13      32.0        16.0    512.0   1468.415976   299.008012   345.088005
14      32.0        16.0   1024.0   2933.759928   589.824021   662.527978
15      32.0        64.0     64.0    632.831991   113.664001   130.048007
16      32.0        64.0    128.0   1467.391968   300.031990   345.088005
17      32.0        64.0    256.0   2933.759928   589.824021   662.527978
18      32.0        64.0    512.0   5834.239960  1174.528003  1305.600047
19      32.0        64.0   1024.0  11636.735916  2344.959974  2589.695930
20      48.0         1.0     64.0     33.792000    10.240000    12.288000
21      48.0         1.0    128.0     40.959999    13.312000    16.384000
22      48.0         1.0    256.0     53.247999    17.408000    21.504000
23      48.0         1.0    512.0     80.895998    28.672000    33.792000
24      48.0         1.0   1024.0    139.264002    47.104001    55.296000
25      48.0         4.0     64.0     53.247999    17.408000    21.504000
26      48.0         4.0    128.0     80.895998    28.672000    33.792000
27      48.0         4.0    256.0    139.264002    47.104001    55.296000
28      48.0         4.0    512.0    382.975996    87.040000    99.327996
29      48.0         4.0   1024.0   1071.104050   189.439997   252.927989
30      48.0        16.0     64.0    139.264002    47.104001    55.296000
31      48.0        16.0    128.0    382.975996    87.040000    99.327996
32      48.0        16.0    256.0   1072.128057   189.439997   252.927989
33      48.0        16.0    512.0   2214.399815   444.415987   492.543995
34      48.0        16.0   1024.0   4390.912056   881.663978   964.608014
35      48.0        64.0     64.0   1072.128057   190.464005   252.927989
36      48.0        64.0    128.0   2213.887930   444.415987   492.543995
37      48.0        64.0    256.0   4390.912056   881.663978   964.608014
38      48.0        64.0    512.0   8738.816261  1757.184029  1902.591944
39      48.0        64.0   1024.0  17432.575226  3509.248018  3782.655954

rmsnorm with residual

python3 benchmark/kernels/rmsnorm/benchmark_fused_rms_norm.py --use_residual

benchmark result:

✅ All implementations match
rmsnorm-performance-with-residual:
    head_num  batch_size  seq_len   HuggingFace   FlashInfer         VLLM
0       32.0         1.0     64.0     41.983999    12.288000    13.312000
1       32.0         1.0    128.0     49.152002    17.408000    17.408000
2       32.0         1.0    256.0     60.416002    20.479999    21.504000
3       32.0         1.0    512.0     95.232002    35.312001    33.792000
4       32.0         1.0   1024.0    154.624000    57.344001    58.368001
5       32.0         4.0     64.0     60.416002    21.504000    21.504000
6       32.0         4.0    128.0     93.184002    34.816001    33.792000
7       32.0         4.0    256.0    154.624000    57.344001    58.368001
8       32.0         4.0    512.0    374.783993   102.399997   102.399997
9       32.0         4.0   1024.0   1134.592056   237.568006   238.591999
10      32.0        16.0     64.0    154.624000    57.344001    58.368001
11      32.0        16.0    128.0    369.664013   102.399997   101.375997
12      32.0        16.0    256.0   1134.592056   237.568006   239.616007
13      32.0        16.0    512.0   2484.224081   588.800013   592.895985
14      32.0        16.0   1024.0   4967.423916  1172.479987  1179.648042
15      32.0        64.0     64.0   1135.615945   237.568006   239.616007
16      32.0        64.0    128.0   2484.224081   588.800013   591.871977
17      32.0        64.0    256.0   4966.400146  1172.479987  1179.648042
18      32.0        64.0    512.0   9902.079582  2340.863943  2351.104021
19      32.0        64.0   1024.0  19777.023315  4678.656101  4694.015980
20      48.0         1.0     64.0     46.080001    14.336000    14.336000
21      48.0         1.0    128.0     57.344001    19.455999    19.455999
22      48.0         1.0    256.0     77.823997    25.599999    27.648000
23      48.0         1.0    512.0    126.975998    46.080001    46.080001
24      48.0         1.0   1024.0    243.711993    79.871997    80.895998
25      48.0         4.0     64.0     77.823997    26.624000    27.648000
26      48.0         4.0    128.0    126.975998    46.080001    45.056000
27      48.0         4.0    256.0    243.711993    78.847997    80.895998
28      48.0         4.0    512.0    753.664017   150.527999   150.527999
29      48.0         4.0   1024.0   1841.151953   435.200006   437.247992
30      48.0        16.0     64.0    243.711993    79.871997    80.895998
31      48.0        16.0    128.0    754.688025   149.504006   151.552007
32      48.0        16.0    256.0   1841.151953   434.175998   436.224014
33      48.0        16.0    512.0   3737.600088   881.663978   890.879989
34      48.0        16.0   1024.0   7436.287880  1756.160021  1771.520019
35      48.0        64.0     64.0   1840.127945   435.200006   437.247992
36      48.0        64.0    128.0   3737.600088   881.663978   890.879989
37      48.0        64.0    256.0   7437.312126  1756.160021  1771.520019
38      48.0        64.0    512.0  14835.712433  3506.175995  3525.631905
39      48.0        64.0   1024.0  29655.040741  7011.328220  7044.095993

benchmark/kernels/rmsnorm/benchmark_rmsnorm.py

zhyncs · 2024-12-15T05:52:27Z

@BBuf Thanks!

ywang96 · 2024-12-16T00:18:02Z

@BBuf This is great! Do you mind if I open a PR with you as co-author to add this script to vLLM repo as well? (if you don't have time to add it yourself)

BBuf · 2024-12-17T01:32:45Z

@BBuf This is great! Do you mind if I open a PR with you as co-author to add this script to vLLM repo as well? (if you don't have time to add it yourself)

I don't mind, thank you for your attention.

BBuf added 2 commits December 15, 2024 13:32

add a benchmark for hf/vllm/sglang rmsnorm

2cf42b5

format

650fd24

zhyncs reviewed Dec 15, 2024

View reviewed changes

benchmark/kernels/rmsnorm/benchmark_rmsnorm.py Outdated Show resolved Hide resolved

upd

172ff51

zhyncs approved these changes Dec 15, 2024

View reviewed changes

zhyncs merged commit a0592c0 into sgl-project:main Dec 15, 2024
1 check passed

zhyncs mentioned this pull request Dec 15, 2024

[Feature] add kernel level benchmark #2402

Open

2 tasks

ywang96 mentioned this pull request Dec 16, 2024

[Misc] Kernel Benchmark for RMSNorm vllm-project/vllm#11241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486

[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486

BBuf commented Dec 15, 2024 •

edited

Loading

zhyncs commented Dec 15, 2024

ywang96 commented Dec 16, 2024

BBuf commented Dec 17, 2024

[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486

[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486

Conversation

BBuf commented Dec 15, 2024 • edited Loading

Motivation

rmsnorm without residual

rmsnorm with residual

zhyncs commented Dec 15, 2024

ywang96 commented Dec 16, 2024

BBuf commented Dec 17, 2024

BBuf commented Dec 15, 2024 •

edited

Loading