-
Notifications
You must be signed in to change notification settings - Fork 738
Closed
Description
Hi DeepGemm team,
Here I integrate the new version of DeepGemm On SM100
The performance test on kernel looks good, but for the e2e throughput performance, it look quite slow compared to triton.
VLLM_USE_DEEP_GEMM=1 vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enforce-eager --enable-expert-parallel --quantization fp8
Throughput: 26.34 requests/s, 28916.28 total tokens/s, 2634.07 output tokens/s
vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enforce-eager --enable-expert-parallel --quantization fp8
Throughput: 36.65 requests/s, 40270.79 total tokens/s, 3665.06 output tokens/s
VLLM_USE_DEEP_GEMM=1 vllm bench throughput --model deepseek-ai/DeepSeek-R1 --load-format dummy --input-len 32 --output-len 128 --trust_remote_code --enforce-eager -tp 8 --enable-expert-parallel --no-enable-prefix-caching
Throughput: 23.89 requests/s, 3821.89 total tokens/s, 3058.29 output tokens/s
# NO deepgemm
Throughput: 42.59 requests/s, 6811.08 total tokens/s, 5451.01 output tokens/sDo you know why this would happen and how could we solve it?
Metadata
Metadata
Assignees
Labels
No labels