-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a performance issue with data parallelism (DP) when a batch does not fill all DP ranks. The change modifies execute_dummy_batch to pass appropriate parameters (uniform_decode and cudagraph_runtime_mode) to the underlying _dummy_run method. This ensures that the dummy batch execution path aligns with whether CUDA graphs are enabled or not, preventing a significant slowdown on idle ranks. The logic appears sound and the benchmarks provided demonstrate a clear performance improvement. However, I found a potential crash scenario in the implementation.
| eager = self.model_config.enforce_eager | ||
| cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \ | ||
| else CUDAGraphMode.FULL | ||
| uniform_decode = not eager | ||
| self.model_runner._dummy_run(1, | ||
| uniform_decode=uniform_decode, | ||
| cudagraph_runtime_mode=cudagraph_runtime_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for determining whether to use CUDA graphs for the dummy batch is incomplete. It only checks self.model_config.enforce_eager but misses self.vllm_config.compilation_config.cudagraph_mode.
If enforce_eager is False but cudagraph_mode is NONE, the current code will incorrectly attempt to run the dummy batch with cudagraph_runtime_mode=CUDAGraphMode.FULL. This will lead to an AssertionError in _dummy_run because the CUDA graph dispatcher will not have been initialized, causing the worker to crash.
The logic should be updated to consider both flags to correctly determine if a CUDA graph path should be used.
| eager = self.model_config.enforce_eager | |
| cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \ | |
| else CUDAGraphMode.FULL | |
| uniform_decode = not eager | |
| self.model_runner._dummy_run(1, | |
| uniform_decode=uniform_decode, | |
| cudagraph_runtime_mode=cudagraph_runtime_mode) | |
| use_graph = (not self.model_config.enforce_eager and | |
| self.vllm_config.compilation_config.cudagraph_mode != | |
| CUDAGraphMode.NONE) | |
| cudagraph_runtime_mode = CUDAGraphMode.FULL if use_graph \ | |
| else CUDAGraphMode.NONE | |
| uniform_decode = use_graph | |
| self.model_runner._dummy_run(1, | |
| uniform_decode=uniform_decode, | |
| cudagraph_runtime_mode=cudagraph_runtime_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use the cudagraph_mode flag here and not use enforce_eager at all
|
Thank you for this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 2812 to 2815 in 4e5affe
| _cg_mode, batch_descriptor = \ | |
| self.cudagraph_dispatcher.dispatch( | |
| BatchDescriptor(num_tokens=num_tokens, | |
| uniform_decode=uniform_decode)) |
So in the case of execute dummy, we set I'm not sure what all the codepaths are that need to be taken care of here. |
I would think the simplest change would be vllm/vllm/v1/worker/gpu_model_runner.py Lines 2837 to 2843 in 7ae9887
so now it looks like we would end up with Im following up on this change. Overall large areas of this code need to be refactored :/ |
|
Update @sighingnow will create a PR to rollback:
As this is no-longer needed. With that rolled back I think this PR should be as simple as |
I've created the PR to roll this back: #25407 |
|
in favor of 25407 |
|
Thanks! |



When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.
without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.
Before
BS1 DP8
BS32 DP8
After
BS1 DP8
BS32 DP8