-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Load tuned fused_moe_lora shrink and expand kernel configs separately #27272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for loading tuned kernel configurations for fused_moe_lora operations, separating shrink and expand configs to allow for more fine-grained performance tuning. The changes are extensive, touching LoRA layer logic, Triton ops, and the Punica wrapper. While the overall direction is good and enables significant performance improvements, I've identified a few critical issues that will cause runtime errors or incorrect behavior. These include incorrect attribute access that will lead to AttributeErrors, usage of incorrect tensor shapes for configuration, and an inconsistent default value that could lead to suboptimal performance. Addressing these issues is crucial for the feature to work as intended.
vllm/lora/layers/fused_moe.py
Outdated
| self.w13_weight.size(), | ||
| self.w2_weight.size(), | ||
| top_k, | ||
| config_dtype, | ||
| block_shape=self.quant_method.moe_quant_config.block_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code attempts to access w13_weight, w2_weight, and quant_method directly from self (the FusedMoEWithLoRA instance). However, these are attributes of the base_layer. This will lead to an AttributeError at runtime. You should access these attributes via self.base_layer.
| self.w13_weight.size(), | |
| self.w2_weight.size(), | |
| top_k, | |
| config_dtype, | |
| block_shape=self.quant_method.moe_quant_config.block_shape, | |
| self.base_layer.w13_weight.size(), | |
| self.base_layer.w2_weight.size(), | |
| top_k, | |
| config_dtype, | |
| block_shape=self.base_layer.quant_method.moe_quant_config.block_shape, |
vllm/lora/layers/fused_moe.py
Outdated
| expand_config = get_lora_op_configs( | ||
| op_type="fused_moe_lora_down_expand", | ||
| max_loras=self.w1_lora_a_stacked.shape[0], | ||
| batch=M, | ||
| hidden_size=self.w1_lora_a_stacked.shape[-1], | ||
| rank=self.w1_lora_a_stacked.shape[-2], | ||
| num_slices=1, | ||
| hidden_size_2=self.w1_lora_b_stacked.shape[-2] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration for fused_moe_lora_down_expand seems to be using shapes from w1_lora_..._stacked tensors, which are intended for the 'up' projection. For the 'down' projection handled in moe_sum_decorator, it should be using shapes from w2_lora_..._stacked tensors to ensure the correct dimensions are passed for kernel tuning.
| expand_config = get_lora_op_configs( | |
| op_type="fused_moe_lora_down_expand", | |
| max_loras=self.w1_lora_a_stacked.shape[0], | |
| batch=M, | |
| hidden_size=self.w1_lora_a_stacked.shape[-1], | |
| rank=self.w1_lora_a_stacked.shape[-2], | |
| num_slices=1, | |
| hidden_size_2=self.w1_lora_b_stacked.shape[-2] | |
| ) | |
| expand_config = get_lora_op_configs( | |
| op_type="fused_moe_lora_down_expand", | |
| max_loras=self.w2_lora_a_stacked.shape[0], | |
| batch=M, | |
| hidden_size=self.w2_lora_a_stacked.shape[-1], | |
| rank=self.w2_lora_a_stacked.shape[-2], | |
| num_slices=1, | |
| hidden_size_2=self.w2_lora_b_stacked.shape[-2] | |
| ) |
vllm/lora/layers/fused_moe.py
Outdated
| self.w13_weight.size(), | ||
| self.w2_weight.size(), | ||
| top_k, | ||
| config_dtype, | ||
| block_shape=self.quant_method.moe_quant_config. | ||
| block_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the issue in act_decorator, the code here attempts to access w13_weight, w2_weight, and quant_method from self instead of self.base_layer. This will cause an AttributeError. These attributes should be accessed through self.base_layer.
| self.w13_weight.size(), | |
| self.w2_weight.size(), | |
| top_k, | |
| config_dtype, | |
| block_shape=self.quant_method.moe_quant_config. | |
| block_shape, | |
| self.base_layer.w13_weight.size(), | |
| self.base_layer.w2_weight.size(), | |
| top_k, | |
| config_dtype, | |
| block_shape=self.base_layer.quant_method.moe_quant_config.block_shape, |
| shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")), | ||
| shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")), | ||
| shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")), | ||
| shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")), | ||
| shrink_config.get("num_warps", 4), | ||
| shrink_config.get("num_stages", 1), | ||
| expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")), | ||
| expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")), | ||
| expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")), | ||
| expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")), | ||
| expand_config.get("num_warps", 4), | ||
| expand_config.get("num_stages", 1), | ||
| mul_routed_weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency in the default value for num_stages. Here, it defaults to 1 if not found in the config. However, the default configuration for fused_moe_lora ops in vllm/lora/ops/triton_ops/utils.py specifies num_stages as 3. This can lead to using a suboptimal number of stages if the tuned config file doesn't specify num_stages, potentially degrading performance. To ensure consistent behavior, the default value here should match the one in utils.py.
| shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")), | |
| shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")), | |
| shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")), | |
| shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")), | |
| shrink_config.get("num_warps", 4), | |
| shrink_config.get("num_stages", 1), | |
| expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")), | |
| expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")), | |
| expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")), | |
| expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")), | |
| expand_config.get("num_warps", 4), | |
| expand_config.get("num_stages", 1), | |
| mul_routed_weight, | |
| shrink_config.get("BLOCK_SIZE_M", shrink_config.get("block_m")), | |
| shrink_config.get("BLOCK_SIZE_N", shrink_config.get("block_n")), | |
| shrink_config.get("BLOCK_SIZE_K", shrink_config.get("block_k")), | |
| shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")), | |
| shrink_config.get("num_warps", 4), | |
| shrink_config.get("num_stages", 3), | |
| expand_config.get("BLOCK_SIZE_M", expand_config.get("block_m")), | |
| expand_config.get("BLOCK_SIZE_N", expand_config.get("block_n")), | |
| expand_config.get("BLOCK_SIZE_K", expand_config.get("block_k")), | |
| expand_config.get("GROUP_SIZE_M", expand_config.get("group_m")), | |
| expand_config.get("num_warps", 4), | |
| expand_config.get("num_stages", 3), | |
| mul_routed_weight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/lora/layers/fused_moe.py
Outdated
| else: # fall back to the default config | ||
| get_config_func = functools.partial( | ||
| try_get_optimal_moe_config, | ||
| self.w13_weight.size(), | ||
| self.w2_weight.size(), | ||
| top_k, | ||
| config_dtype, | ||
| block_shape=self.quant_method.moe_quant_config.block_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference base-layer attributes instead of self in fallback path
Inside the decorators that compute default MoE configs, the new code now reads self.w13_weight, self.w2_weight, and self.quant_method. Those attributes live on the wrapped base layer (layer) rather than on FusedMoEWithLoRA itself, so when VLLM_TUNED_CONFIG_FOLDER is not set the fallback branch will raise AttributeError before any config is produced. This breaks the non‑tuned execution path entirely. The same pattern is repeated in the later branch that prepares the down kernel configs.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuned configs pass None for GROUP_SIZE_M
When launching the fused MoE LoRA kernels, add_lora_fused_moe derives the group size via shrink_config.get("GROUP_SIZE_M", shrink_config.get("group_m")) (and similarly for expand). Configs returned by get_lora_op_configs use the key group_size_m, so both lookups miss and the call forwards None as GROUP_SIZE_M to the Triton kernel whenever tuned configs are used. The kernel’s block alignment requires an integer and will fail at runtime. The fallback should check "group_size_m" instead.
Useful? React with 👍 / 👎.
|
|
||
| For `expand`, the config fileis named as `{gpu_name}_EXPAND_{add_input}.json`, e.g. `NVIDIA_H200_EXPAND_TRUE.json`. | ||
|
|
||
| For `fused_moe_lora_gate_up_shrink`, the config file is named as `{gpu_name}_FUSED_MOE_LORA_GATE_UP_SHRINK.json`, e.g. `NVIDIA_H200_FUSED_MOE_LORA_GATE_UP_SHRINK.json`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Can benchmark_lora.py benchmark fused_moe_lora_gate_up_shrink and fused_moe_lora_gate_up_expand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it looks like benchmark_lora.py only supports the lora_shrink and lora_expand kernels
vllm/benchmarks/kernels/benchmark_lora.py
Line 196 in e05a675
| def from_str(s: str) -> "OpType": |
|
This pull request has merge conflicts that must be resolved before it can be |
| "BLOCK_SIZE_K": block_size_k, | ||
| "GROUP_SIZE_M": group_size_m, | ||
| shrink_config = { | ||
| "BLOCK_SIZE_M": shrink_block_size_m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to keep the key name consistent with #26319
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make it consistent
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
fused_moe_lorakernelshrinkandexpandkernel configs in thefused_moe_lorafunctionnum_stagesandnum_warpsparameters in the configsNote: Based on PR #21229 and #26319
Test Plan
Test Result
Together with #26319 we can improve the OTPS 80% - 90% in GPT-OSS-120B when concurrency is 1 or 2.
(Optional) Documentation Update