Skip to content

Conversation

@zklapow
Copy link
Contributor

@zklapow zklapow commented Oct 14, 2025

Purpose

See title.

I ran the following:

benchmark_moe.py --model zai-org/GLM-4.5-Air --dtype auto --tp 4 --enable-expert-parallel --tune
benchmark_moe.py --model zai-org/GLM-4.5-Air --dtype auto --tp 2 --enable-expert-parallel --tune
benchmark_moe.py --model zai-org/GLM-4.6-FP8 --dtype fp8_w8a8 --tp 4 --enable-expert-parallel --tune

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@zklapow zklapow requested a review from mgoin as a code owner October 14, 2025 16:00
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new MoE kernel tuning configurations for GLM 4.6-FP8 and GLM 4.5 Air on NVIDIA B200 GPUs. The configurations for GLM 4.6-FP8 (tp=4) and GLM 4.5 Air (tp=2) appear consistent with the benchmark commands provided. However, I've identified a significant inconsistency in the configuration file for GLM 4.5 Air with tp=4, which may be misnamed or generated incorrectly. This could impact performance and needs to be addressed.

@@ -0,0 +1,147 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be an inconsistency in this configuration file's name and content when compared with the benchmark command from the pull request description for zai-org/GLM-4.5-Air with tp=4.

Here's a breakdown of the inconsistencies, assuming GLM-4.5-Air has 64 experts:

  • Number of Experts (E): With tp=4 and expert parallelism enabled, each rank should handle 64 / 4 = 16 experts. The E parameter in the filename, which seems to represent the number of local experts based on other files in this PR, should be 16, but it is 64. For comparison, the tp=2 configuration for the same model correctly uses E=32 (64 / 2 = 32).
  • Intermediate Size (N): The intermediate size N is 1408 in this file, which is the same as in the tp=2 configuration. Since the intermediate size is typically sharded across tensor parallel ranks, the value of N should be different for tp=2 and tp=4. If N=1408 is correct for tp=2 (implying a total intermediate size of 1408 * 2 = 2816), then for tp=4 it should be 2816 / 4 = 704.

This suggests the file might be misnamed or generated from an incorrect configuration, which could lead to suboptimal performance or errors. Please double-check how this file was generated.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any benchmarks showing that these configs improve performance?

@zklapow
Copy link
Contributor Author

zklapow commented Oct 14, 2025

yes though the improvement is only very slight:

Without tuning:

============ Serving Benchmark Result ============
Successful requests:                     36
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  62.88
Total input tokens:                      288000
Total generated tokens:                  36000
Request throughput (req/s):              0.57
Output token throughput (tok/s):         572.55
Peak output token throughput (tok/s):    792.00
Peak concurrent requests:                36.00
Total Token throughput (tok/s):          5152.95
---------------Time to First Token----------------
Mean TTFT (ms):                          8425.51
Median TTFT (ms):                        8411.25
P99 TTFT (ms):                           15872.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.17
Median TPOT (ms):                        54.23
P99 TPOT (ms):                           61.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.26
Median ITL (ms):                         47.21
P99 ITL (ms):                            91.23
==================================================

with tuning:

============ Serving Benchmark Result ============
Successful requests:                     36
Request rate configured (RPS):           10000.00
Benchmark duration (s):                  62.77
Total input tokens:                      288000
Total generated tokens:                  36000
Request throughput (req/s):              0.57
Output token throughput (tok/s):         573.54
Peak output token throughput (tok/s):    810.00
Peak concurrent requests:                36.00
Total Token throughput (tok/s):          5161.84
---------------Time to First Token----------------
Mean TTFT (ms):                          8435.97
Median TTFT (ms):                        8413.40
P99 TTFT (ms):                           15903.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.06
Median TPOT (ms):                        54.12
P99 TPOT (ms):                           61.15
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.16
Median ITL (ms):                         47.14
P99 ITL (ms):                            99.60
==================================================

This is for glm-4.6-fp8, I seem to have lost my results for -Air (and the machine I have is currently in use)

using the following config

model: zai-org/GLM-4.6-FP8
host: "0.0.0.0"
port: 8002
served-model-name: glm-4-6
tensor-parallel-size: 4
gpu-memory-utilization: 0.95
distributed-executor-backend: mp
tool-call-parser: glm45
reasoning-parser: glm45
enable-auto-tool-choice: true
enable-expert-parallel: true
async-scheduling: true

enable-prefix-caching: false

max-num-seqs: 256
enable-chunked-prefill: true
max-num-batched-tokens: 16384

@vllm-bot vllm-bot merged commit aba48f7 into vllm-project:main Oct 14, 2025
7 checks passed
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…ia B200 (vllm-project#26818)

Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ia B200 (vllm-project#26818)

Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ia B200 (vllm-project#26818)

Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ia B200 (vllm-project#26818)

Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…ia B200 (vllm-project#26818)

Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants