[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268

shivampr · 2025-10-06T01:18:03Z

Purpose

This PR adds tuned Fused-MoE Triton configs for Qwen3-30B A3 (E=64) and A3B (E=128) on NVIDIA H100 80GB, covering both BF16 and FP8 (fp8_w8a8). It’s part of #22294.

New files (under vllm/model_executor/layers/fused_moe/configs/):

E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json
E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json

Each table is tuned per (E, dtype) and only keeps the “turning-point” batches to avoid bloat. For FP8, I used zero-init weights + unit scales (PyTorch can’t randn float8 yet); this doesn’t affect which kernel gets picked—selection depends on shapes and launch params.

Test Plan

Env: H100 80GB (SXM) · CUDA 12.8 · PyTorch 2.8.0 · vLLM wheel (with Triton)

What I verified

Loader picks up the tuned JSONs for each (E, dtype).
If a file is hidden, vLLM falls back to default.
The four JSONs are not identical.

Repro

python - <<'PY'
import logging, torch
from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk, fused_experts
logging.getLogger("vllm.model_executor.layers.fused_moe.fused_moe").setLevel(logging.INFO)

H, N, top_k, B = 7168, 8960, 8, 32
def run(E, dtype):
    xdtype = torch.bfloat16 if dtype=="bf16" else torch.float16
    wdt    = xdtype if dtype=="bf16" else torch.float8_e4m3fn
    x  = torch.randn(B, H, dtype=xdtype, device="cuda")
    w1 = torch.zeros(E, N,   H, dtype=wdt, device="cuda")
    w2 = torch.zeros(E, H, N//2, dtype=wdt, device="cuda")
    gate = torch.randn(B, E, dtype=torch.float32, device="cuda")
    tw, ti, _ = fused_topk(x, gate, top_k, renormalize=True)
    fused_experts(x, w1, w2, tw, ti, inplace=True, quant_config=None)

for E in (64, 128):
  for dt in ("bf16", "fp8_w8a8"):
    print(f"\n=== E={E} {dt} ===")
    run(E, dt)
PY

Loader logs (proof of pickup)

INFO ... Using configuration from .../E=64,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=64,N=8960,...,dtype=fp8_w8a8.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=fp8_w8a8.json ...

github-actions · 2025-10-06T01:18:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces tuned Triton configurations for Fused-MoE kernels to optimize performance for Qwen3-30B models on H100 GPUs. While the intent is to provide specific configs for FP8 and BF16 data types with expert counts of 64 and 128, a critical issue has been identified: all four newly added JSON configuration files are identical. This suggests a copy-paste error, which would lead to using incorrect and non-optimal kernel parameters for at least three of the four scenarios, potentially degrading performance instead of improving it. It is crucial to replace the duplicated content with the correctly tuned configurations for each specific file.

...utor/layers/fused_moe/configs/E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json

- Add JSONs for E=64 and E=128, N=8960 with device_name=NVIDIA_H100_80GB_HBM3 - Verified loader pickup on H100 80GB (logs show “Using configuration from …” for fp8_w8a8 and bf16) Refs: vllm-project#22294 Signed-off-by: Shivam <shivampr.dev@gmail.com>

…BF16 & FP8); per-(E,dtype) distinct tables Signed-off-by: Shivam <shivampr.dev@gmail.com>

shivampr · 2025-10-09T05:40:11Z

Hi @mgoin ,
I am first time contributor to this repo, since you are assigned to this review, would love to get your feedback.

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

shivampr requested a review from mgoin as a code owner October 6, 2025 01:18

mergify bot added the qwen Related to Qwen models label Oct 6, 2025

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

...utor/layers/fused_moe/configs/E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json Show resolved Hide resolved

shivampr force-pushed the feat/qwen3-h100-moe-configs branch from b0146ac to 086ce44 Compare October 6, 2025 01:22

shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 086ce44 to 14699ba Compare October 6, 2025 05:21

[Kernel][Model] Tune fused_moe configs for Qwen3-30B A3/A3B on H100 (…

fde213e

…BF16 & FP8); per-(E,dtype) distinct tables Signed-off-by: Shivam <shivampr.dev@gmail.com>

shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 7c2e05c to fde213e Compare October 6, 2025 06:30

shivampr added 2 commits October 5, 2025 23:43

Merge branch 'main' into feat/qwen3-h100-moe-configs

6224e13

Merge branch 'main' into feat/qwen3-h100-moe-configs

0edce11

mgoin approved these changes Oct 20, 2025

View reviewed changes

vllm-bot merged commit 4d0f266 into vllm-project:main Oct 20, 2025
6 checks passed

adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

d384888

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

faaany pushed a commit to faaany/vllm that referenced this pull request Oct 21, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

6a69f0b

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 21, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

413c06a

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

Ther-LF pushed a commit to Ther-LF/vllm that referenced this pull request Oct 22, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

f0dc2cc

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

2e80f3a

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…

db874ea

… H100 (FP8/BF16) (vllm-project#26268) Signed-off-by: Shivam <shivampr.dev@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268

shivampr commented Oct 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

shivampr commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268

[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) #26268

Conversation

shivampr commented Oct 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

github-actions bot commented Oct 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

shivampr commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shivampr commented Oct 6, 2025 •

edited by github-actions bot

Loading