Load tuned fused_moe_lora shrink and expand kernel configs separately #27272

yugong333 · 2025-10-21T16:26:47Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Loading the kernel configs of fused_moe_lora kernel
Separate shrink and expand kernel configs in the fused_moe_lora function
Adding num_stages and num_warps parameters in the configs

Note: Based on PR #21229 and #26319

Test Plan

Test Result

Together with #26319 we can improve the OTPS 80% - 90% in GPT-OSS-120B when concurrency is 1 or 2.

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request introduces support for loading tuned kernel configurations for fused_moe_lora operations, separating shrink and expand configs to allow for more fine-grained performance tuning. The changes are extensive, touching LoRA layer logic, Triton ops, and the Punica wrapper. While the overall direction is good and enables significant performance improvements, I've identified a few critical issues that will cause runtime errors or incorrect behavior. These include incorrect attribute access that will lead to AttributeErrors, usage of incorrect tensor shapes for configuration, and an inconsistent default value that could lead to suboptimal performance. Addressing these issues is crucial for the feature to work as intended.

gemini-code-assist · 2025-10-21T16:29:29Z

vllm/lora/layers/fused_moe.py

+                        self.w13_weight.size(),
+                        self.w2_weight.size(),
+                        top_k,
+                        config_dtype,
+                        block_shape=self.quant_method.moe_quant_config.block_shape,


The code attempts to access w13_weight, w2_weight, and quant_method directly from self (the FusedMoEWithLoRA instance). However, these are attributes of the base_layer. This will lead to an AttributeError at runtime. You should access these attributes via self.base_layer.

Suggested change

self.w13_weight.size(),

self.w2_weight.size(),

top_k,

config_dtype,

block_shape=self.quant_method.moe_quant_config.block_shape,

self.base_layer.w13_weight.size(),

self.base_layer.w2_weight.size(),

top_k,

config_dtype,

block_shape=self.base_layer.quant_method.moe_quant_config.block_shape,

gemini-code-assist · 2025-10-21T16:29:30Z

vllm/lora/layers/fused_moe.py

+                    expand_config = get_lora_op_configs(
+                        op_type="fused_moe_lora_down_expand",
+                        max_loras=self.w1_lora_a_stacked.shape[0],
+                        batch=M,
+                        hidden_size=self.w1_lora_a_stacked.shape[-1],
+                        rank=self.w1_lora_a_stacked.shape[-2],
+                        num_slices=1,
+                        hidden_size_2=self.w1_lora_b_stacked.shape[-2]
+                    )


The configuration for fused_moe_lora_down_expand seems to be using shapes from w1_lora_..._stacked tensors, which are intended for the 'up' projection. For the 'down' projection handled in moe_sum_decorator, it should be using shapes from w2_lora_..._stacked tensors to ensure the correct dimensions are passed for kernel tuning.

Suggested change

expand_config = get_lora_op_configs(

op_type="fused_moe_lora_down_expand",

max_loras=self.w1_lora_a_stacked.shape[0],

batch=M,

hidden_size=self.w1_lora_a_stacked.shape[-1],

rank=self.w1_lora_a_stacked.shape[-2],

num_slices=1,

hidden_size_2=self.w1_lora_b_stacked.shape[-2]

)

expand_config = get_lora_op_configs(

op_type="fused_moe_lora_down_expand",

max_loras=self.w2_lora_a_stacked.shape[0],

batch=M,

hidden_size=self.w2_lora_a_stacked.shape[-1],

rank=self.w2_lora_a_stacked.shape[-2],

num_slices=1,

hidden_size_2=self.w2_lora_b_stacked.shape[-2]

)

gemini-code-assist · 2025-10-21T16:29:30Z

vllm/lora/layers/fused_moe.py

+                            self.w13_weight.size(),
+                            self.w2_weight.size(),
+                            top_k,
+                            config_dtype,
+                            block_shape=self.quant_method.moe_quant_config.
+                            block_shape,


Similar to the issue in act_decorator, the code here attempts to access w13_weight, w2_weight, and quant_method from self instead of self.base_layer. This will cause an AttributeError. These attributes should be accessed through self.base_layer.

Suggested change

self.w13_weight.size(),

self.w2_weight.size(),

top_k,

config_dtype,

block_shape=self.quant_method.moe_quant_config.

block_shape,

self.base_layer.w13_weight.size(),

self.base_layer.w2_weight.size(),

top_k,

config_dtype,

block_shape=self.base_layer.quant_method.moe_quant_config.block_shape,

gemini-code-assist · 2025-10-21T16:29:30Z

vllm/lora/punica_wrapper/punica_gpu.py

+            shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),
+            shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),
+            shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),
+            shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),
+            shrink_config.get("num_warps", 4),
+            shrink_config.get("num_stages", 1),
+            expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),
+            expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),
+            expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),
+            expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),
+            expand_config.get("num_warps", 4),
+            expand_config.get("num_stages", 1),
            mul_routed_weight,


There's an inconsistency in the default value for num_stages. Here, it defaults to 1 if not found in the config. However, the default configuration for fused_moe_lora ops in vllm/lora/ops/triton_ops/utils.py specifies num_stages as 3. This can lead to using a suboptimal number of stages if the tuned config file doesn't specify num_stages, potentially degrading performance. To ensure consistent behavior, the default value here should match the one in utils.py.

Suggested change

shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),

shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),

shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),

shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),

shrink_config.get("num_warps", 4),

shrink_config.get("num_stages", 1),

expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),

expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),

expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),

expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),

expand_config.get("num_warps", 4),

expand_config.get("num_stages", 1),

mul_routed_weight,

shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),

shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),

shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),

shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),

shrink_config.get("num_warps", 4),

shrink_config.get("num_stages", 3),

expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),

expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),

expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),

expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),

expand_config.get("num_warps", 4),

expand_config.get("num_stages", 3),

mul_routed_weight,

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-10-21T16:30:44Z

vllm/lora/layers/fused_moe.py

+                else: # fall back to the default config 
+                    get_config_func = functools.partial(
+                        try_get_optimal_moe_config,
+                        self.w13_weight.size(),
+                        self.w2_weight.size(),
+                        top_k,
+                        config_dtype,
+                        block_shape=self.quant_method.moe_quant_config.block_shape,


Reference base-layer attributes instead of self in fallback path

Inside the decorators that compute default MoE configs, the new code now reads self.w13_weight, self.w2_weight, and self.quant_method. Those attributes live on the wrapped base layer (layer) rather than on FusedMoEWithLoRA itself, so when VLLM_TUNED_CONFIG_FOLDER is not set the fallback branch will raise AttributeError before any config is produced. This breaks the non‑tuned execution path entirely. The same pattern is repeated in the later branch that prepares the down kernel configs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-21T16:30:44Z

vllm/lora/punica_wrapper/punica_gpu.py

Tuned configs pass None for GROUP_SIZE_M

When launching the fused MoE LoRA kernels, add_lora_fused_moe derives the group size via shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")) (and similarly for expand). Configs returned by get_lora_op_configs use the key group_size_m, so both lookups miss and the call forwards None as GROUP_SIZE_M to the Triton kernel whenever tuned configs are used. The kernel’s block alignment requires an integer and will fail at runtime. The fallback should check "group_size_m" instead.

Useful? React with 👍 / 👎.

jeejeelee · 2025-10-22T07:54:05Z

vllm/lora/ops/triton_ops/README_TUNING.md


 For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`.

+For `fused_moe_lora_gate_up_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_GATE_UP_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_GATE_UP_SHRINK.json`.


QQ: Can benchmark_lora.py benchmark fused_moe_lora_gate_up_shrink and fused_moe_lora_gate_up_expand?

Currently it looks like benchmark_lora.py only supports the lora_shrink and lora_expand kernels

vllm/benchmarks/kernels/benchmark_lora.py

Line 196 in e05a675

def from_str(s: str) -> "OpType":

,

mergify · 2025-10-22T17:58:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

li2haipeng · 2025-10-23T07:14:39Z

vllm/lora/ops/triton_ops/fused_moe_lora_op.py

-        "BLOCK_SIZE_K": block_size_k,
-        "GROUP_SIZE_M": group_size_m,
+    shrink_config = {
+        "BLOCK_SIZE_M": shrink_block_size_m,


It's better to keep the key name consistent with #26319

Let me make it consistent

yugong333 requested a review from jeejeelee as a code owner October 21, 2025 16:26

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 21, 2025

View reviewed changes

yugong333 requested a review from hmellor as a code owner October 21, 2025 18:52

jeejeelee reviewed Oct 22, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 22, 2025

yugong333 force-pushed the main branch from 7de1442 to 1e8fee4 Compare October 23, 2025 04:56

mergify bot removed the needs-rebase label Oct 23, 2025

li2haipeng reviewed Oct 23, 2025

View reviewed changes

yugong333 closed this Oct 23, 2025

yugong333 force-pushed the main branch from 93cc79e to dbfbf9f Compare October 23, 2025 20:03


		For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`.

		For `fused_moe_lora_gate_up_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_GATE_UP_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_GATE_UP_SHRINK.json`.

Uh oh!

Load tuned fused_moe_lora shrink and expand kernel configs separately #27272

Load tuned fused_moe_lora shrink and expand kernel configs separately #27272

Uh oh!

Conversation

yugong333 commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

dcmaddix Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

li2haipeng Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

yugong333 Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yugong333 commented Oct 21, 2025 •

edited by github-actions bot

Loading