Low E2E throughput performance on SM100

Hi DeepGemm team,

https://github.com/vllm-project/vllm/pull/19820

Here I integrate the new version of DeepGemm On SM100

The performance test on kernel looks good, but for the e2e throughput performance, it look quite slow compared to triton.

```bash
VLLM_USE_DEEP_GEMM=1 vllm bench throughput  --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enforce-eager --enable-expert-parallel --quantization fp8
Throughput: 26.34 requests/s, 28916.28 total tokens/s, 2634.07 output tokens/s
vllm bench throughput  --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enforce-eager --enable-expert-parallel --quantization fp8
Throughput: 36.65 requests/s, 40270.79 total tokens/s, 3665.06 output tokens/s


VLLM_USE_DEEP_GEMM=1 vllm bench throughput  --model deepseek-ai/DeepSeek-R1 --load-format dummy --input-len 32 --output-len 128 --trust_remote_code --enforce-eager -tp 8 --enable-expert-parallel  --no-enable-prefix-caching
Throughput: 23.89 requests/s, 3821.89 total tokens/s, 3058.29 output tokens/s
# NO deepgemm
Throughput: 42.59 requests/s, 6811.08 total tokens/s, 5451.01 output tokens/s
```

Do you know why this would happen and how could we solve it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low E2E throughput performance on SM100 #118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low E2E throughput performance on SM100 #118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions