Skip to content

[Feature]: Generalize RoutingMethodType for broader MoE routing control #28408

@mgoin

Description

@mgoin

🚀 The feature, motivation and pitch

PR #27492 introduced RoutingMethodType to support different routing methods for FP8 flashinfer TRTLLM MOE (DeepSeekV3, Llama4, Renormalize, etc.).
While this was implemented to support Qwen3 and Qwen3-next models, the review discussion revealed opportunities to use this more broadly across the
codebase to simplify MoE routing configuration.

Motivation:

Currently, MoE routing behavior is controlled through multiple fragmented parameters (scoring_func, renormalize, use_grouped_topk, custom routing
functions, etc.). This creates several issues:

  1. Lack of clarity: The routing method isn't explicitly defined in one place
  2. Code duplication: Each model must explicitly specify routing parameters
  3. Maintenance burden: Adding new routing methods requires updates across multiple locations
  4. Tight coupling: Current implementation is tied to flashinfer's specific enum values

As noted by @mgoin:
"I like the idea of having a routing method type so we can reduce the need for hacks like checking the llama 4 custom routing function within the
quant method... I think if we do this right, we can actually remove other arguments we have in FusedMoE such as renormalize."

Proposed improvements:

  1. Auto-derive routing type: Instead of requiring each model to explicitly set routing_method_type, automatically derive it from existing parameters
    (scoring_func, renormalize, use_grouped_topk, top_k, etc.) within FusedMoE.__init__
  2. Decouple from flashinfer: Make RoutingMethodType a vLLM-native abstraction that works across all fused MoE backends (not just flashinfer TRTLLM),
    with backend-specific mapping happening at the kernel level
  3. Simplify FusedMoE API: Remove redundant parameters like renormalize and potentially apply_router_weight_on_input by folding them into the routing
    type
  4. Support explicit override: Allow models to explicitly specify routing type when auto-derivation isn't sufficient
  5. Router abstraction: Consider implementing router objects/functions that can be passed directly (as suggested by @bnellnm)

Alternatives

Keep the current approach of using multiple discrete parameters (scoring_func, renormalize, etc.), but this requires ongoing maintenance of mapping
logic scattered across quant methods and model code.

Additional context

Related PR: #27492 - Initial implementation of RoutingMethodType

Code locations that would benefit:

  • vllm/model_executor/layers/fused_moe/config.py:RoutingMethodType - Make backend-agnostic
  • vllm/model_executor/layers/fused_moe/layer.py:FusedMoE.__init__ - Add auto-derivation logic
  • vllm/model_executor/layers/quantization/fp8.py - Simplify routing type usage
  • vllm/model_executor/models/qwen3_moe.py - Should not need explicit routing_method_type
  • vllm/model_executor/models/qwen3_next.py - Should not need explicit routing_method_type

cc @bnellnm @jiahanc @pavanimajety

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions