Skip to content

Conversation

@shivampr
Copy link
Contributor

@shivampr shivampr commented Oct 6, 2025

Purpose

This PR adds tuned Fused-MoE Triton configs for Qwen3-30B A3 (E=64) and A3B (E=128) on NVIDIA H100 80GB, covering both BF16 and FP8 (fp8_w8a8). It’s part of #22294.

New files (under vllm/model_executor/layers/fused_moe/configs/):

  • E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
  • E=64,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json
  • E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=bf16.json
  • E=128,N=8960,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json

Each table is tuned per (E, dtype) and only keeps the “turning-point” batches to avoid bloat. For FP8, I used zero-init weights + unit scales (PyTorch can’t randn float8 yet); this doesn’t affect which kernel gets picked—selection depends on shapes and launch params.

Test Plan

Env: H100 80GB (SXM) · CUDA 12.8 · PyTorch 2.8.0 · vLLM wheel (with Triton)

What I verified

  • Loader picks up the tuned JSONs for each (E, dtype).
  • If a file is hidden, vLLM falls back to default.
  • The four JSONs are not identical.

Repro

python - <<'PY'
import logging, torch
from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk, fused_experts
logging.getLogger("vllm.model_executor.layers.fused_moe.fused_moe").setLevel(logging.INFO)

H, N, top_k, B = 7168, 8960, 8, 32
def run(E, dtype):
    xdtype = torch.bfloat16 if dtype=="bf16" else torch.float16
    wdt    = xdtype if dtype=="bf16" else torch.float8_e4m3fn
    x  = torch.randn(B, H, dtype=xdtype, device="cuda")
    w1 = torch.zeros(E, N,   H, dtype=wdt, device="cuda")
    w2 = torch.zeros(E, H, N//2, dtype=wdt, device="cuda")
    gate = torch.randn(B, E, dtype=torch.float32, device="cuda")
    tw, ti, _ = fused_topk(x, gate, top_k, renormalize=True)
    fused_experts(x, w1, w2, tw, ti, inplace=True, quant_config=None)

for E in (64, 128):
  for dt in ("bf16", "fp8_w8a8"):
    print(f"\n=== E={E} {dt} ===")
    run(E, dt)
PY

Loader logs (proof of pickup)

INFO ... Using configuration from .../E=64,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=64,N=8960,...,dtype=fp8_w8a8.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=bf16.json ...
INFO ... Using configuration from .../E=128,N=8960,...,dtype=fp8_w8a8.json ...

@shivampr shivampr requested a review from mgoin as a code owner October 6, 2025 01:18
@github-actions
Copy link

github-actions bot commented Oct 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tuned Triton configurations for Fused-MoE kernels to optimize performance for Qwen3-30B models on H100 GPUs. While the intent is to provide specific configs for FP8 and BF16 data types with expert counts of 64 and 128, a critical issue has been identified: all four newly added JSON configuration files are identical. This suggests a copy-paste error, which would lead to using incorrect and non-optimal kernel parameters for at least three of the four scenarios, potentially degrading performance instead of improving it. It is crucial to replace the duplicated content with the correctly tuned configurations for each specific file.

@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from b0146ac to 086ce44 Compare October 6, 2025 01:22
- Add JSONs for E=64 and E=128, N=8960 with device_name=NVIDIA_H100_80GB_HBM3
- Verified loader pickup on H100 80GB (logs show “Using configuration from …” for fp8_w8a8 and bf16)

Refs: vllm-project#22294
Signed-off-by: Shivam <shivampr.dev@gmail.com>
@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 086ce44 to 14699ba Compare October 6, 2025 05:21
…BF16 & FP8); per-(E,dtype) distinct tables

Signed-off-by: Shivam <shivampr.dev@gmail.com>
@shivampr shivampr force-pushed the feat/qwen3-h100-moe-configs branch from 7c2e05c to fde213e Compare October 6, 2025 06:30
@shivampr
Copy link
Contributor Author

shivampr commented Oct 9, 2025

Hi @mgoin ,
I am first time contributor to this repo, since you are assigned to this review, would love to get your feedback.

@vllm-bot vllm-bot merged commit 4d0f266 into vllm-project:main Oct 20, 2025
6 checks passed
adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
faaany pushed a commit to faaany/vllm that referenced this pull request Oct 21, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 21, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Ther-LF pushed a commit to Ther-LF/vllm that referenced this pull request Oct 22, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
… H100 (FP8/BF16) (vllm-project#26268)

Signed-off-by: Shivam <shivampr.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants