[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 #26818

zklapow · 2025-10-14T16:00:31Z

Purpose

See title.

I ran the following:

benchmark_moe.py --model zai-org/GLM-4.5-Air --dtype auto --tp 4 --enable-expert-parallel --tune
benchmark_moe.py --model zai-org/GLM-4.5-Air --dtype auto --tp 2 --enable-expert-parallel --tune
benchmark_moe.py --model zai-org/GLM-4.6-FP8 --dtype fp8_w8a8 --tp 4 --enable-expert-parallel --tune

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-10-14T16:01:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request adds new MoE kernel tuning configurations for GLM 4.6-FP8 and GLM 4.5 Air on NVIDIA B200 GPUs. The configurations for GLM 4.6-FP8 (tp=4) and GLM 4.5 Air (tp=2) appear consistent with the benchmark commands provided. However, I've identified a significant inconsistency in the configuration file for GLM 4.5 Air with tp=4, which may be misnamed or generated incorrectly. This could impact performance and needs to be addressed.

gemini-code-assist · 2025-10-14T16:02:03Z

vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_B200.json

@@ -0,0 +1,147 @@
+{


There appears to be an inconsistency in this configuration file's name and content when compared with the benchmark command from the pull request description for zai-org/GLM-4.5-Air with tp=4.

Here's a breakdown of the inconsistencies, assuming GLM-4.5-Air has 64 experts:

Number of Experts (E): With tp=4 and expert parallelism enabled, each rank should handle 64 / 4 = 16 experts. The E parameter in the filename, which seems to represent the number of local experts based on other files in this PR, should be 16, but it is 64. For comparison, the tp=2 configuration for the same model correctly uses E=32 (64 / 2 = 32).

Intermediate Size (N): The intermediate size N is 1408 in this file, which is the same as in the tp=2 configuration. Since the intermediate size is typically sharded across tensor parallel ranks, the value of N should be different for tp=2 and tp=4. If N=1408 is correct for tp=2 (implying a total intermediate size of 1408 * 2 = 2816), then for tp=4 it should be 2816 / 4 = 704.

This suggests the file might be misnamed or generated from an incorrect configuration, which could lead to suboptimal performance or errors. Please double-check how this file was generated.

mgoin

Do you have any benchmarks showing that these configs improve performance?

zklapow · 2025-10-14T17:13:51Z

yes though the improvement is only very slight:

Without tuning:

============ Serving Benchmark Result ============
Successful requests:                     36
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  62.88
Total input tokens:                      288000
Total generated tokens:                  36000
Request throughput (req/s):              0.57
Output token throughput (tok/s):         572.55
Peak output token throughput (tok/s):    792.00
Peak concurrent requests:                36.00
Total Token throughput (tok/s):          5152.95
---------------Time to First Token----------------
Mean TTFT (ms):                          8425.51
Median TTFT (ms):                        8411.25
P99 TTFT (ms):                           15872.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.17
Median TPOT (ms):                        54.23
P99 TPOT (ms):                           61.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.26
Median ITL (ms):                         47.21
P99 ITL (ms):                            91.23
==================================================

with tuning:

============ Serving Benchmark Result ============
Successful requests:                     36
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  62.77
Total input tokens:                      288000
Total generated tokens:                  36000
Request throughput (req/s):              0.57
Output token throughput (tok/s):         573.54
Peak output token throughput (tok/s):    810.00
Peak concurrent requests:                36.00
Total Token throughput (tok/s):          5161.84
---------------Time to First Token----------------
Mean TTFT (ms):                          8435.97
Median TTFT (ms):                        8413.40
P99 TTFT (ms):                           15903.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.06
Median TPOT (ms):                        54.12
P99 TPOT (ms):                           61.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.16
Median ITL (ms):                         47.14
P99 ITL (ms):                            99.60
==================================================

This is for glm-4.6-fp8, I seem to have lost my results for -Air (and the machine I have is currently in use)

using the following config

model: zai-org/GLM-4.6-FP8
host: "0.0.0.0"
port: 8002
served-model-name: glm-4-6
tensor-parallel-size: 4
gpu-memory-utilization: 0.95
distributed-executor-backend: mp
tool-call-parser: glm45
reasoning-parser: glm45
enable-auto-tool-choice: true
enable-expert-parallel: true
async-scheduling: true

enable-prefix-caching: false

max-num-seqs: 256
enable-chunked-prefill: true
max-num-batched-tokens: 16384

…ia B200 (vllm-project#26818) Signed-off-by: bbartels <benjamin@bartels.dev>

…ia B200 (vllm-project#26818)

…ia B200 (vllm-project#26818) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…ia B200 (vllm-project#26818) Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…ia B200 (vllm-project#26818)

Add more B200 MOE tunings

9e69998

zklapow requested a review from mgoin as a code owner October 14, 2025 16:00

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

mgoin approved these changes Oct 14, 2025

View reviewed changes

vllm-bot merged commit aba48f7 into vllm-project:main Oct 14, 2025
7 checks passed

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

d56c9f2

…ia B200 (vllm-project#26818) Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

61774a4

…ia B200 (vllm-project#26818)

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

25a8277

…ia B200 (vllm-project#26818)

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

ae47ade

…ia B200 (vllm-project#26818) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

fb5356d

…ia B200 (vllm-project#26818) Signed-off-by: xuebwang-amd <xuebwang@amd.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

84b6cb5

…ia B200 (vllm-project#26818) Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

9e97fb2

…ia B200 (vllm-project#26818) Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

ce995c9

…ia B200 (vllm-project#26818)

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…

5462235

…ia B200 (vllm-project#26818)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 #26818

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 #26818

Uh oh!

zklapow commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

mgoin left a comment

Uh oh!

zklapow commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 #26818

[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 #26818

Uh oh!

Conversation

zklapow commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

zklapow commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zklapow commented Oct 14, 2025 •

edited by github-actions bot

Loading

zklapow commented Oct 14, 2025 •

edited

Loading