[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440

alexm-redhat · 2025-10-08T19:46:58Z

This PR executes the shared_experts part of the FusedMoE on a separate GPU stream, so that the execution is parallelized with the "selected_experts" part. This is possible since the outputs of both are independent and are later combined. Thanks @wenscarl for pointing this out.

For DeepSeekR1 FP8 with Flashinfer latency kernels (trtllm-gen) on 8xB200s batch size 32, the TPOT improves from 23.35ms to 22.09ms (with latest FlashInfer codebase), so about ~5.7% e2e improvement.

mergify · 2025-10-08T19:47:33Z

Documentation preview: https://vllm--26440.org.readthedocs.build/en/26440/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-08T19:49:01Z

vllm/model_executor/layers/fused_moe/layer.py

+        use_explicit_se = (
+            not isinstance(self.quant_method.fused_experts, FusedMoEModularKernel)
+            and self.shared_experts is not None
+        )
+        if use_explicit_se:
+            current_stream = torch.cuda.current_stream()
+            self.shared_experts_stream.wait_stream(current_stream)
+
+        router_logits, _ = self.gate(hidden_states)
+


Avoid recomputing router logits via nonexistent gate

The updated forward_impl now calls self.gate(hidden_states) and discards the router_logits argument. FusedMoE does not define a gate module by default (the base property returns None), and most existing callers supply precomputed logits and have no gate attribute wired into the layer. This will raise a TypeError (NoneType is not callable) the first time a standard MoE block without shared experts executes, effectively breaking every model that previously passed logits into FusedMoE.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-08T19:49:01Z

vllm/model_executor/layers/shared_fused_moe/shared_fused_moe.py

    def __init__(
        self,
        shared_experts: torch.nn.Module,
+        gate: torch.nn.Module,
        use_overlapped: bool = True,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self._shared_experts = shared_experts
+        self._gate = gate


New SharedFusedMoE constructor breaks existing call sites

The constructor now requires a gate module but none of the existing usages (e.g. in glm4_moe.py and llama4.py) pass this parameter. Instantiating those models will now raise a TypeError at import time because the additional positional argument has no default. Unless every caller is updated simultaneously, this change makes all current shared-fused MoE models unusable.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces a performance optimization by executing the shared_experts computation on a separate CUDA stream, allowing it to run in parallel with the selected_experts part. This is a great idea for improving throughput.

I've found a critical issue in the implementation that could cause crashes in certain MoE configurations. Please see my comment below for details and a suggested fix.

gemini-code-assist · 2025-10-08T19:50:18Z

vllm/model_executor/layers/fused_moe/layer.py

+        if use_explicit_se:
+            current_stream = torch.cuda.current_stream()
+            self.shared_experts_stream.wait_stream(current_stream)
+
+        router_logits, _ = self.gate(hidden_states)


The call to self.gate(hidden_states) is unconditional, but self.gate can be None for MoE layers that do not use shared experts (e.g., the base FusedMoE class). This will cause a TypeError when forward_impl is called for such layers.

For instance, when DeepseekV2MoE is configured without shared experts, it creates a plain FusedMoE instance. Its forward method computes router_logits and passes them to self.experts.forward(). However, the modified forward_impl ignores these logits and attempts to call self.gate(), which is None for FusedMoE, leading to a crash.

To fix this, the gate computation should only happen when use_explicit_se is true, which is the case for shared experts where the gate is guaranteed to exist.

Suggested change

if use_explicit_se:

current_stream = torch.cuda.current_stream()

self.shared_experts_stream.wait_stream(current_stream)

router_logits, _ = self.gate(hidden_states)

if use_explicit_se:

current_stream = torch.cuda.current_stream()

self.shared_experts_stream.wait_stream(current_stream)

router_logits, _ = self.gate(hidden_states)

mergify · 2025-10-11T05:20:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector · 2025-10-13T20:30:01Z

💡 Codex Review

vllm/vllm/model_executor/models/deepseek_v2.py

Lines 276 to 283 in 70bfd91

    
           if isinstance(self.experts, SharedFusedMoE): 
        
               fused_moe_out = self.experts( 
        
                   hidden_states=hidden_states, router_logits=hidden_states 
        
               ) 
        
           else: 
        
               # router_logits: (num_tokens, n_experts) 
        
               router_logits, _ = self.gate(hidden_states) 
        
               fused_moe_out = self.experts(

Router logits skipped when SharedFusedMoE runs on GPU

In the SharedFusedMoE path the gate is no longer called before invoking the fused MoE kernel. DeepseekV2MoE.forward now calls self.experts(hidden_states, router_logits=hidden_states) whenever self.experts is a SharedFusedMoE, but FusedMoE.forward_native (the GPU path used by super().forward) still expects router_logits to already contain gate outputs and never invokes self.gate. As a result, the custom CUDA op receives hidden_states instead of logits, which both violates the expected (num_tokens, n_experts) shape and skips routing entirely, leading to runtime errors or incorrect routing for every GPU execution of SharedFusedMoE. The gating step needs to be reinstated or moved inside the GPU path before calling the custom op.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

LucasWilkinson · 2025-10-20T13:13:46Z

Do you mind just quickly checking that this doesnt break DBO (i.e. --enable-dbo, you can test it with DeepSeek-V2-Lite with dp > 1 and DeepEP LL)

alexm-redhat · 2025-10-20T13:57:42Z

@LucasWilkinson will check now

LucasWilkinson

Nice optimization! Overall looks pretty good to me assuming DBO works; left a few comments

LucasWilkinson · 2025-10-20T20:10:18Z

vllm/model_executor/layers/fused_moe/layer.py

+        # TODO: Allow disabling of the separate shared experts stream for
+        # debug purposes. Remove this after more extensive testings with
+        # TP/DP and other execution modes
+        disable_shared_experts_stream = os.environ.get(


should we move this to envs.py?

ditto, I think VLLM_DISABLE_SHARED_EXPERTS_STREAM is fine

Good idea, moved to envs.py

vllm/model_executor/layers/fused_moe/layer.py

LucasWilkinson · 2025-10-20T20:16:30Z

vllm/model_executor/layers/fused_moe/layer.py

+        #        parallel execution of shared experts with the FusedMoE via
+        #        separate cuda stream)
+        if self.gate is not None:
+            router_logits, _ = self.gate(hidden_states)


do we know if moving this out of the torch.compile region affects perf if we are not using multi-stream?

This won't run when multi-stream is disabled. As I understand, gate is always inside a torch compiled region, no?

ah ok; I think its a bit confusing that we always pass gate intoSharedFusedMoE; I think its hard to tell the control flow in the modeling code maybe instead of:

if isinstance(self.experts, SharedFusedMoE) and self.experts.use_overlapped: fused_moe_out = self.experts( hidden_states=hidden_states, router_logits=hidden_states ) else: # router_logits: (num_tokens, n_experts) router_logits, _ = self.gate(hidden_states) fused_moe_out = self.experts( hidden_states=hidden_states, router_logits=router_logits )

we can do

class SharedFusedMoE(FusedMoE): def forward( self, hidden_states: torch.Tensor, ) -> tuple[torch.Tensor, torch.Tensor]: if not self.use_overlapped: ... router_logits, _ = self.gate(hidden_states) fused_out = super().forward( hidden_states=hidden_states, router_logits=router_logits, ) else: shared_out, fused_out = super().forward( hidden_states=hidden_states, router_logits=hidden_states, ) return shared_out, fused_out

this way in the modeling code we can assume that if we are using SharedFusedMoE it will always handle the gate?

That's a good idea, let me try it, I may have some issues with the interface removing the router_logits input, but let's see how I can remove it.

Actually there is a problem when the FusedMoE class is not SharedFusedMoE, since then the gate() needs to be outside anyway. I.e the if/else cannot be removed, however, I can remove the "non-trivial" check with "overlap" by simply providing a function like is_router_internal() for the FusedMoE base class. Will try to do it.

Added is_internal_router property so it is cleaner now.

LucasWilkinson · 2025-10-20T20:18:53Z

vllm/model_executor/layers/fused_moe/layer.py

+        ):
+            # Start the separate shared experts stream here since we want
+            # to run in parallel with the router/gate (next op below)
+            current_stream = torch.cuda.current_stream()


nit: use from vllm.utils.torch_utils import current_stream

self.shared_experts_stream.wait_stream(current_stream())

Nice, didn't know it is possible simply to import. I suspect it is a constant handle.

LucasWilkinson · 2025-10-20T20:21:07Z

vllm/model_executor/layers/fused_moe/layer.py

+                    # For chunked, we start the shared experts stream here
+                    # (Note that no concurrency with the router/gate)
+                    current_stream = torch.cuda.current_stream()
+                    self.shared_experts_stream.wait_stream(current_stream)


nit: use from vllm.utils.torch_utils import current_stream (

vllm/vllm/utils/torch_utils.py

Lines 349 to 358 in f9e7ad5

"""

replace `torch.cuda.current_stream()` with `vllm.utils.current_stream()`.

it turns out that `torch.cuda.current_stream()` is quite expensive,

as it will construct a new stream object at each call.

here we patch `torch.cuda.set_stream` to keep track of the current stream

directly, so that we can avoid calling `torch.cuda.current_stream()`.

the underlying hypothesis is that we do not call `torch._C._cuda_setStream`

from C/C++ code.

"""

)

self.shared_experts_stream.wait_stream(current_stream())

mgoin · 2025-10-20T20:24:14Z

vllm/model_executor/layers/fused_moe/layer.py

+        # TODO: Allow disabling of the separate shared experts stream for
+        # debug purposes. Remove this after more extensive testings with
+        # TP/DP and other execution modes
+        disable_shared_experts_stream = os.environ.get(


ditto, I think VLLM_DISABLE_SHARED_EXPERTS_STREAM is fine

mgoin · 2025-10-20T20:25:18Z

vllm/model_executor/layers/fused_moe/layer.py

+            "DISABLE_MOE_SHARED_EXPERTS_CUDA_STREAM", None
+        )
+
+        if disable_shared_experts_stream is not None:


Change the var from None by default to False and just do a regular bool check

Fixed to bool / False

nvpohanh · 2025-10-21T06:58:19Z

vllm/model_executor/models/deepseek_v2.py

-        fused_moe_out = self.experts(
-            hidden_states=hidden_states, router_logits=router_logits
-        )
+        if isinstance(self.experts, SharedFusedMoE) and self.experts.use_overlapped:


Does it mean that any model that want to utilize the multi-stream feature must update their own model definition code? For example, will Qwen-Next also benefit from this change?

It is a bit complicated, since there are 2 improvements: (1) the use of the cuda stream for shared_experts and (2) the moving of the gate / router op to be after the shared_experts execution (so it is parallelized as well). For all models, (1) will be done via the fact they use SharedFusedMoE, however, for (2) you need to change the model code to move the gate inside (like it is done here for DeepSeekV2).

In terms of perf, around 70% comes from (1) and 30% from (2).

cross-posting: https://github.com/vllm-project/vllm/pull/26440/files#r2449131815

…el with the FusedMoE) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com>

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

### What this PR does / why we need it? Upgrade to new vllm commit: vllm-project/vllm@c9461e0 - Fix many imports, caused by vllm-project/vllm#26908 - Fix import ```sha256```, caused by vllm-project/vllm#27169 - Remove ```SchedulerConfig.send_delta_data```, caused by vllm-project/vllm#27142 - Fix ```FusedMoE``` because of dual stream execution, caused by vllm-project/vllm#26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Upgrade to new vllm commit: vllm-project/vllm@c9461e0 - Fix many imports, caused by vllm-project/vllm#26908 - Fix import ```sha256```, caused by vllm-project/vllm#27169 - Remove ```SchedulerConfig.send_delta_data```, caused by vllm-project/vllm#27142 - Fix ```FusedMoE``` because of dual stream execution, caused by vllm-project/vllm#26440 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

ZJY0516 · 2025-10-27T12:26:16Z

I was wondering whether this can be done throgh torch.compile custom pass?

pavanimajety · 2025-10-31T22:16:50Z

vllm/model_executor/layers/fused_moe/layer.py

+            logger.info_once("Disabling MoE shared_experts cuda stream")
+            self.shared_experts_stream = None
+        else:
+            self.shared_experts_stream = torch.cuda.Stream()


@alexm-redhat We may need to have global two streams rather than two streams per FusedMoE layer. With the feature we see an explosion of streams which may not be ideal

alexm-redhat self-assigned this Oct 8, 2025

alexm-redhat requested a review from mgoin as a code owner October 8, 2025 19:46

alexm-redhat requested review from LucasWilkinson and robertgshaw2-redhat October 8, 2025 19:47

mergify bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models labels Oct 8, 2025

alexm-redhat marked this pull request as draft October 8, 2025 19:48

chatgpt-codex-connector bot reviewed Oct 8, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 11, 2025

alexm-redhat force-pushed the moe_dual_stream branch from 4b23bfe to 70bfd91 Compare October 13, 2025 20:25

alexm-redhat marked this pull request as ready for review October 13, 2025 20:26

alexm-redhat force-pushed the moe_dual_stream branch from 70bfd91 to f284f2a Compare October 13, 2025 20:27

alexm-redhat force-pushed the moe_dual_stream branch 3 times, most recently from f7a3214 to 68d5997 Compare October 14, 2025 14:19

mergify bot removed the needs-rebase label Oct 14, 2025

alexm-redhat force-pushed the moe_dual_stream branch 6 times, most recently from 05900a9 to 6f30ab9 Compare October 16, 2025 16:11

alexm-redhat force-pushed the moe_dual_stream branch from 6f30ab9 to ccce3d3 Compare October 20, 2025 19:57

LucasWilkinson approved these changes Oct 20, 2025

View reviewed changes

mgoin reviewed Oct 20, 2025

View reviewed changes

nvpohanh reviewed Oct 21, 2025

View reviewed changes

alexm-redhat force-pushed the moe_dual_stream branch from ccce3d3 to 99cddf1 Compare October 21, 2025 17:08

alexm-redhat requested a review from pavanimajety as a code owner October 21, 2025 17:08

alexm-redhat force-pushed the moe_dual_stream branch from 99cddf1 to e6a534d Compare October 21, 2025 19:39

[Performance] Run shared_experts on a separate cuda stream (in parall…

eae5398

…el with the FusedMoE) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

alexm-redhat force-pushed the moe_dual_stream branch from e6a534d to eae5398 Compare October 21, 2025 19:41

alexm-redhat enabled auto-merge (squash) October 21, 2025 19:41

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025

alexm-redhat merged commit 344a001 into main Oct 21, 2025
61 checks passed

alexm-redhat deleted the moe_dual_stream branch October 21, 2025 21:38

baonudesifeizhai pushed a commit to baonudesifeizhai/vllm that referenced this pull request Oct 21, 2025

[Performance] Dual stream execution of "shared_experts" and "selected…

b7ade65

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

jeejeelee mentioned this pull request Oct 22, 2025

[Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA #27351

Merged

5 tasks

vadiklyutiy mentioned this pull request Oct 22, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

This was referenced Oct 23, 2025

Upgrade to new vllm commit and Refactor DeepSeekV3.2 vllm-project/vllm-ascend#3644

Closed

Upgrade to new vllm commit vllm-project/vllm-ascend#3663

Closed

usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025

[Performance] Dual stream execution of "shared_experts" and "selected…

e8c891b

…_experts" inside FusedMoE (vllm-project#26440) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

wxsIcey mentioned this pull request Oct 24, 2025

Upgrade to new vllm commit vllm-project/vllm-ascend#3719

Merged

ZJY0516 mentioned this pull request Oct 27, 2025

[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next #27578

Merged

5 tasks

pavanimajety reviewed Oct 31, 2025

View reviewed changes

	"""
	replace `torch.cuda.current_stream()` with `vllm.utils.current_stream()`.
	it turns out that `torch.cuda.current_stream()` is quite expensive,
	as it will construct a new stream object at each call.
	here we patch `torch.cuda.set_stream` to keep track of the current stream
	directly, so that we can avoid calling `torch.cuda.current_stream()`.

	the underlying hypothesis is that we do not call `torch._C._cuda_setStream`
	from C/C++ code.
	"""

Uh oh!

[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440

[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE #26440

Conversation

alexm-redhat commented Oct 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 13, 2025

💡 Codex Review

Uh oh!

LucasWilkinson commented Oct 20, 2025

Uh oh!

alexm-redhat commented Oct 20, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexm-redhat commented Oct 8, 2025 •

edited by github-actions bot

Loading