-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
🚀 The feature, motivation and pitch
PR #27492 introduced RoutingMethodType to support different routing methods for FP8 flashinfer TRTLLM MOE (DeepSeekV3, Llama4, Renormalize, etc.).
While this was implemented to support Qwen3 and Qwen3-next models, the review discussion revealed opportunities to use this more broadly across the
codebase to simplify MoE routing configuration.
Motivation:
Currently, MoE routing behavior is controlled through multiple fragmented parameters (scoring_func, renormalize, use_grouped_topk, custom routing
functions, etc.). This creates several issues:
- Lack of clarity: The routing method isn't explicitly defined in one place
- Code duplication: Each model must explicitly specify routing parameters
- Maintenance burden: Adding new routing methods requires updates across multiple locations
- Tight coupling: Current implementation is tied to flashinfer's specific enum values
As noted by @mgoin:
"I like the idea of having a routing method type so we can reduce the need for hacks like checking the llama 4 custom routing function within the
quant method... I think if we do this right, we can actually remove other arguments we have in FusedMoE such as renormalize."
Proposed improvements:
- Auto-derive routing type: Instead of requiring each model to explicitly set routing_method_type, automatically derive it from existing parameters
(scoring_func,renormalize,use_grouped_topk,top_k, etc.) withinFusedMoE.__init__ - Decouple from flashinfer: Make RoutingMethodType a vLLM-native abstraction that works across all fused MoE backends (not just flashinfer TRTLLM),
with backend-specific mapping happening at the kernel level - Simplify FusedMoE API: Remove redundant parameters like
renormalizeand potentially apply_router_weight_on_input by folding them into the routing
type - Support explicit override: Allow models to explicitly specify routing type when auto-derivation isn't sufficient
- Router abstraction: Consider implementing router objects/functions that can be passed directly (as suggested by @bnellnm)
Alternatives
Keep the current approach of using multiple discrete parameters (scoring_func, renormalize, etc.), but this requires ongoing maintenance of mapping
logic scattered across quant methods and model code.
Additional context
Related PR: #27492 - Initial implementation of RoutingMethodType
Code locations that would benefit:
vllm/model_executor/layers/fused_moe/config.py:RoutingMethodType- Make backend-agnosticvllm/model_executor/layers/fused_moe/layer.py:FusedMoE.__init__- Add auto-derivation logicvllm/model_executor/layers/quantization/fp8.py- Simplify routing type usagevllm/model_executor/models/qwen3_moe.py- Should not need explicit routing_method_typevllm/model_executor/models/qwen3_next.py- Should not need explicit routing_method_type
cc @bnellnm @jiahanc @pavanimajety
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.