[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm #2486
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
When deploying qwen2-7b, I compared the nsys profile results between vllm and sglang and found that
rmsnorm
in vllm was somewhat slower. To easily verify the kernel performance ofrmsnorm
with different input tensor sizes, I submitted this benchmark script. The script allows us to test the performance ofrmsnorm
kernels with or without residual connections by using the--use_residual
flag. Below are the results obtained on an NVIDIA GeForce RTX 4090 GPU. As shown, the flashinferrmsnorm
kernels achieved much better performance across almost all shapes specified in the script.rmsnorm without residual
benchmark result:
rmsnorm with residual
benchmark result: