-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bug] Fix DBO IMA issue for DeepEPHT #27666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a critical illegal memory access error when using DBO with DeepEP High Throughput kernels. The fix involves adding proper synchronization between vLLM's compute stream and DeepEP's internal streams by capturing and passing a CUDA event. The changes look correct and address the reported issue. However, I've identified a potential issue in the newly introduced utility function dbo_get_previous_event which could be a source of bugs in the future due to its implicit behavior. I've suggested a change to make it more explicit and robust.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Purpose
VLLM_ALL2ALL_BACKEND=deepep_high_throughput vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --no-enable-prefix-caching --enable-expert-parallel --enable-dbo --enforce-eagerwill trigger
This is because we didn't care the internal comm stream of deepEP, this PR fixes that
Test
VLLM_ALL2ALL_BACKEND=deepep_high_throughput vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 4 --no-enable-prefix-caching --enable-expert-parallel --enable-dbo --enforce-eager(APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /start_profile, Methods: POST (APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /stop_profile, Methods: POST (APIServer pid=2481637) INFO 10-28 09:06:51 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=2481637) INFO: Started server process [2481637] (APIServer pid=2481637) INFO: Waiting for application startup. (APIServer pid=2481637) INFO: Application startup complete.