Skip to content

Conversation

@yugong333
Copy link
Contributor

@yugong333 yugong333 commented Oct 21, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

  1. Loading the kernel configs of fused_moe_lora kernel
  2. Separate shrink and expand kernel configs in the fused_moe_lora function
  3. Adding num_stages and num_warps parameters in the configs

Note: Based on PR #21229 and #26319

Test Plan

Test Result

Together with #26319 we can improve the OTPS 80% - 90% in GPT-OSS-120B when concurrency is 1 or 2.

image

(Optional) Documentation Update

@yugong333 yugong333 requested a review from jeejeelee as a code owner October 21, 2025 16:26
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for loading tuned kernel configurations for fused_moe_lora operations, separating shrink and expand configs to allow for more fine-grained performance tuning. The changes are extensive, touching LoRA layer logic, Triton ops, and the Punica wrapper. While the overall direction is good and enables significant performance improvements, I've identified a few critical issues that will cause runtime errors or incorrect behavior. These include incorrect attribute access that will lead to AttributeErrors, usage of incorrect tensor shapes for configuration, and an inconsistent default value that could lead to suboptimal performance. Addressing these issues is crucial for the feature to work as intended.

Comment on lines 131 to 135
self.w13_weight.size(),
self.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.quant_method.moe_quant_config.block_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code attempts to access w13_weight, w2_weight, and quant_method directly from self (the FusedMoEWithLoRA instance). However, these are attributes of the base_layer. This will lead to an AttributeError at runtime. You should access these attributes via self.base_layer.

Suggested change
self.w13_weight.size(),
self.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.quant_method.moe_quant_config.block_shape,
self.base_layer.w13_weight.size(),
self.base_layer.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.base_layer.quant_method.moe_quant_config.block_shape,

Comment on lines 217 to 225
expand_config = get_lora_op_configs(
op_type="fused_moe_lora_down_expand",
max_loras=self.w1_lora_a_stacked.shape[0],
batch=M,
hidden_size=self.w1_lora_a_stacked.shape[-1],
rank=self.w1_lora_a_stacked.shape[-2],
num_slices=1,
hidden_size_2=self.w1_lora_b_stacked.shape[-2]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The configuration for fused_moe_lora_down_expand seems to be using shapes from w1_lora_..._stacked tensors, which are intended for the 'up' projection. For the 'down' projection handled in moe_sum_decorator, it should be using shapes from w2_lora_..._stacked tensors to ensure the correct dimensions are passed for kernel tuning.

Suggested change
expand_config = get_lora_op_configs(
op_type="fused_moe_lora_down_expand",
max_loras=self.w1_lora_a_stacked.shape[0],
batch=M,
hidden_size=self.w1_lora_a_stacked.shape[-1],
rank=self.w1_lora_a_stacked.shape[-2],
num_slices=1,
hidden_size_2=self.w1_lora_b_stacked.shape[-2]
)
expand_config = get_lora_op_configs(
op_type="fused_moe_lora_down_expand",
max_loras=self.w2_lora_a_stacked.shape[0],
batch=M,
hidden_size=self.w2_lora_a_stacked.shape[-1],
rank=self.w2_lora_a_stacked.shape[-2],
num_slices=1,
hidden_size_2=self.w2_lora_b_stacked.shape[-2]
)

Comment on lines 229 to 234
self.w13_weight.size(),
self.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.quant_method.moe_quant_config.
block_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the issue in act_decorator, the code here attempts to access w13_weight, w2_weight, and quant_method from self instead of self.base_layer. This will cause an AttributeError. These attributes should be accessed through self.base_layer.

Suggested change
self.w13_weight.size(),
self.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.quant_method.moe_quant_config.
block_shape,
self.base_layer.w13_weight.size(),
self.base_layer.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.base_layer.quant_method.moe_quant_config.block_shape,

Comment on lines 383 to 397
shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),
shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),
shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),
shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),
shrink_config.get("num_warps", 4),
shrink_config.get("num_stages", 1),
expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),
expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),
expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),
expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),
expand_config.get("num_warps", 4),
expand_config.get("num_stages", 1),
mul_routed_weight,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency in the default value for num_stages. Here, it defaults to 1 if not found in the config. However, the default configuration for fused_moe_lora ops in vllm/lora/ops/triton_ops/utils.py specifies num_stages as 3. This can lead to using a suboptimal number of stages if the tuned config file doesn't specify num_stages, potentially degrading performance. To ensure consistent behavior, the default value here should match the one in utils.py.

Suggested change
shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),
shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),
shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),
shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),
shrink_config.get("num_warps", 4),
shrink_config.get("num_stages", 1),
expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),
expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),
expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),
expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),
expand_config.get("num_warps", 4),
expand_config.get("num_stages", 1),
mul_routed_weight,
shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")),
shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")),
shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")),
shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")),
shrink_config.get("num_warps", 4),
shrink_config.get("num_stages", 3),
expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")),
expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")),
expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")),
expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")),
expand_config.get("num_warps", 4),
expand_config.get("num_stages", 3),
mul_routed_weight,

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 128 to 135
else: # fall back to the default config
get_config_func = functools.partial(
try_get_optimal_moe_config,
self.w13_weight.size(),
self.w2_weight.size(),
top_k,
config_dtype,
block_shape=self.quant_method.moe_quant_config.block_shape,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Reference base-layer attributes instead of self in fallback path

Inside the decorators that compute default MoE configs, the new code now reads self.w13_weight, self.w2_weight, and self.quant_method. Those attributes live on the wrapped base layer (layer) rather than on FusedMoEWithLoRA itself, so when VLLM_TUNED_CONFIG_FOLDER is not set the fallback branch will raise AttributeError before any config is produced. This breaks the non‑tuned execution path entirely. The same pattern is repeated in the later branch that prepares the down kernel configs.

Useful? React with 👍 / 👎.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Tuned configs pass None for GROUP_SIZE_M

When launching the fused MoE LoRA kernels, add_lora_fused_moe derives the group size via shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")) (and similarly for expand). Configs returned by get_lora_op_configs use the key group_size_m, so both lookups miss and the call forwards None as GROUP_SIZE_M to the Triton kernel whenever tuned configs are used. The kernel’s block alignment requires an integer and will fail at runtime. The fallback should check "group_size_m" instead.

Useful? React with 👍 / 👎.

@yugong333 yugong333 requested a review from hmellor as a code owner October 21, 2025 18:52

For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`.

For `fused_moe_lora_gate_up_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_GATE_UP_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_GATE_UP_SHRINK.json`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Can benchmark_lora.py benchmark fused_moe_lora_gate_up_shrink and fused_moe_lora_gate_up_expand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it looks like benchmark_lora.py only supports the lora_shrink and lora_expand kernels

def from_str(s: str) -> "OpType":
,

@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @yugong333.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

"BLOCK_SIZE_K": block_size_k,
"GROUP_SIZE_M": group_size_m,
shrink_config = {
"BLOCK_SIZE_M": shrink_block_size_m,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to keep the key name consistent with #26319

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make it consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants