[Bug] Fix DBO IMA issue for DeepEPHT #27666

yewentao256 · 2025-10-28T16:08:56Z

Purpose

VLLM_ALL2ALL_BACKEND=deepep_high_throughput vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --no-enable-prefix-caching --enable-expert-parallel --enable-dbo --enforce-eager

will trigger

^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/ep_kernels_workspace/DeepEP/deep_ep/buffer.py", line 393, in dispatch
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     return forward_call(*args, **kwargs)
    self.runtime.intranode_dispatch(x, x_scales, topk_idx, topk_weights,
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m RuntimeError: DeepEP error: CPU recv timeout
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 1184, in forward
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     return self._finalize(
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m            ^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 1070, in _finalize
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     finalize_ret = self.prepare_finalize.finalize_async(
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py", line 385, in finalize_async
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     receiver = self._finalize(
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m                ^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py", line 338, in _finalize
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     dbo_yield_and_switch_from_compute_to_comm()
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/ubatching.py", line 160, in wrapper
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     func(ctx, *args, **kwargs)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/ubatching.py", line 134, in yield_and_switch_from_compute_to_comm
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     self._wait_compute_done()
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/ubatching.py", line 84, in _wait_compute_done
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     self.comm_stream.wait_event(self.gpu_compute_done_event)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 57, in wait_event
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     event.wait(self)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 203, in wait
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     super().wait(stream)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m torch.AcceleratorError: CUDA error: an illegal memory access was encountered

...

^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 3464, in _dummy_run
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     outputs = self.model(
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m               ^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_ubatch_wrapper.py", line 466, in __call__
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     return self._run_ubatches(ubatch_metadata, self.model)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m   File "/home/wentao/vllm-source/vllm/v1/worker/gpu_ubatch_wrapper.py", line 283, in _run_ubatches
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m     result = torch.cat(sorted_results, dim=0)
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_DP0 pid=2344883)^[[0;0m RuntimeError: torch.cat(): expected a non-empty list of Tensors

This is because we didn't care the internal comm stream of deepEP, this PR fixes that

Test

VLLM_ALL2ALL_BACKEND=deepep_high_throughput vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --no-enable-prefix-caching --enable-expert-parallel --enable-dbo --enforce-eager

(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /start_profile, Methods: POST
(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /stop_profile, Methods: POST
(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=2481637) INFO:     Started server process [2481637]
(APIServer pid=2481637) INFO:     Waiting for application startup.
(APIServer pid=2481637) INFO:     Application startup complete.

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request fixes a critical illegal memory access error when using DBO with DeepEP High Throughput kernels. The fix involves adding proper synchronization between vLLM's compute stream and DeepEP's internal streams by capturing and passing a CUDA event. The changes look correct and address the reported issue. However, I've identified a potential issue in the newly introduced utility function dbo_get_previous_event which could be a source of bugs in the future due to its implicit behavior. I've suggested a change to make it more explicit and robust.

vllm/v1/worker/ubatching.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

fix dbo IMA issue

720ae2b

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from mgoin and pavanimajety as code owners October 28, 2025 16:08

mergify bot added the v1 label Oct 28, 2025

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

vllm/v1/worker/ubatching.py Show resolved Hide resolved

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025

SageMoore approved these changes Oct 29, 2025

View reviewed changes

tlrmchlsmth approved these changes Oct 29, 2025

View reviewed changes

tlrmchlsmth merged commit b5d90f7 into main Oct 29, 2025
58 of 59 checks passed

tlrmchlsmth deleted the wentao-fix-dbo-IMA-issue branch October 29, 2025 20:28

MatthewBonanni pushed a commit to MatthewBonanni/vllm that referenced this pull request Oct 30, 2025

[Bug] Fix DBO IMA issue for DeepEPHT (vllm-project#27666)

a8fadbe

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Bug] Fix DBO IMA issue for DeepEPHT (vllm-project#27666)

a7296fe

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

[Bug] Fix DBO IMA issue for DeepEPHT (vllm-project#27666)

4275ef0

Signed-off-by: yewentao256 <zhyanwentao@126.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bug] Fix DBO IMA issue for DeepEPHT (vllm-project#27666)

378df21

Signed-off-by: yewentao256 <zhyanwentao@126.com>

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Nov 12, 2025

[Bug] Fix DBO IMA issue for DeepEPHT (vllm-project#27666)

9b113e6

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Fix DBO IMA issue for DeepEPHT #27666

[Bug] Fix DBO IMA issue for DeepEPHT #27666

Uh oh!

yewentao256 commented Oct 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Bug] Fix DBO IMA issue for DeepEPHT #27666

[Bug] Fix DBO IMA issue for DeepEPHT #27666

Uh oh!

Conversation

yewentao256 commented Oct 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewentao256 commented Oct 28, 2025 •

edited by github-actions bot

Loading