Skip to content

Conversation

@Bam4d
Copy link
Contributor

@Bam4d Bam4d commented Sep 16, 2025

When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.

without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.

Before

BS1 DP8

Mean ITL (ms):                           137.62    
Median ITL (ms):                         131.40    
P99 ITL (ms):                            194.54

BS32 DP8

Mean ITL (ms):                           30.49     
Median ITL (ms):                         30.09     
P99 ITL (ms):                            46.84

After

BS1 DP8

Mean ITL (ms):                           31.70     << Significantly faster ITL at BS1
Median ITL (ms):                         30.82     
P99 ITL (ms):                            49.60    

BS32 DP8

Mean ITL (ms):                           30.10     
Median ITL (ms):                         29.82     
P99 ITL (ms):                            35.32 

@Bam4d Bam4d changed the title Bam4d/fix bs lte dp Fix slow execution when BS <= DP Sep 16, 2025
@mergify mergify bot added the v1 label Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a performance issue with data parallelism (DP) when a batch does not fill all DP ranks. The change modifies execute_dummy_batch to pass appropriate parameters (uniform_decode and cudagraph_runtime_mode) to the underlying _dummy_run method. This ensures that the dummy batch execution path aligns with whether CUDA graphs are enabled or not, preventing a significant slowdown on idle ranks. The logic appears sound and the benchmarks provided demonstrate a clear performance improvement. However, I found a potential crash scenario in the implementation.

Comment on lines +491 to +497
eager = self.model_config.enforce_eager
cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \
else CUDAGraphMode.FULL
uniform_decode = not eager
self.model_runner._dummy_run(1,
uniform_decode=uniform_decode,
cudagraph_runtime_mode=cudagraph_runtime_mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for determining whether to use CUDA graphs for the dummy batch is incomplete. It only checks self.model_config.enforce_eager but misses self.vllm_config.compilation_config.cudagraph_mode.

If enforce_eager is False but cudagraph_mode is NONE, the current code will incorrectly attempt to run the dummy batch with cudagraph_runtime_mode=CUDAGraphMode.FULL. This will lead to an AssertionError in _dummy_run because the CUDA graph dispatcher will not have been initialized, causing the worker to crash.

The logic should be updated to consider both flags to correctly determine if a CUDA graph path should be used.

Suggested change
eager = self.model_config.enforce_eager
cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \
else CUDAGraphMode.FULL
uniform_decode = not eager
self.model_runner._dummy_run(1,
uniform_decode=uniform_decode,
cudagraph_runtime_mode=cudagraph_runtime_mode)
use_graph = (not self.model_config.enforce_eager and
self.vllm_config.compilation_config.cudagraph_mode !=
CUDAGraphMode.NONE)
cudagraph_runtime_mode = CUDAGraphMode.FULL if use_graph \
else CUDAGraphMode.NONE
uniform_decode = use_graph
self.model_runner._dummy_run(1,
uniform_decode=uniform_decode,
cudagraph_runtime_mode=cudagraph_runtime_mode)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use the cudagraph_mode flag here and not use enforce_eager at all

@robertgshaw2-redhat
Copy link
Collaborator

Thank you for this PR!

@robertgshaw2-redhat robertgshaw2-redhat changed the title Fix slow execution when BS <= DP [DP/EP] Fix slow execution when BS <= DP Sep 16, 2025
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

_cg_mode, batch_descriptor = \
self.cudagraph_dispatcher.dispatch(
BatchDescriptor(num_tokens=num_tokens,
uniform_decode=uniform_decode))

@Bam4d
Copy link
Contributor Author

Bam4d commented Sep 16, 2025

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

_cg_mode, batch_descriptor = \
self.cudagraph_dispatcher.dispatch(
BatchDescriptor(num_tokens=num_tokens,
uniform_decode=uniform_decode))

So in the case of execute dummy, we set force_attention=True + uniform_decode=True and then let the dispatcher set cudagraph_runtime_mode?

I'm not sure what all the codepaths are that need to be taken care of here.

@LucasWilkinson
Copy link
Collaborator

LucasWilkinson commented Sep 17, 2025

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

_cg_mode, batch_descriptor = \
self.cudagraph_dispatcher.dispatch(
BatchDescriptor(num_tokens=num_tokens,
uniform_decode=uniform_decode))

So in the case of execute dummy, we set force_attention=True + uniform_decode=True and then let the dispatcher set cudagraph_runtime_mode?

I'm not sure what all the codepaths are that need to be taken care of here.

I would think the simplest change would be self.model_runner._dummy_run(1, uniform_decode=True) to basically set it up to use a decode cudagraph; uniform_decode would result in the dummy batch being constructed via

elif uniform_decode:
num_reqs = num_tokens // max_query_len
assert num_reqs <= max_num_reqs, \
"Do not capture num_reqs > max_num_reqs for uniform batch"
num_scheduled_tokens_list = [max_query_len] * num_reqs
if num_tokens % max_query_len != 0:
num_scheduled_tokens_list[-1] += num_tokens % max_query_len
and would make to a decode cudagraph naturally but it seems #24526 modified that code:

image

so now it looks like we would end up with num_reqs = 0 for spec-decode (i.e. max_query_len > 1)

Im following up on this change. Overall large areas of this code need to be refactored :/

@simon-mo simon-mo added this to the v0.10.3 milestone Sep 19, 2025
@njhill njhill changed the title [DP/EP] Fix slow execution when BS <= DP [BugFix] [DP/EP] Fix slow execution when BS <= DP Sep 19, 2025
@LucasWilkinson
Copy link
Collaborator

Update @sighingnow will create a PR to rollback:

image

As this is no-longer needed.

With that rolled back I think this PR should be as simple as self.model_runner._dummy_run(1, uniform_decode=True)

@MatthewBonanni
Copy link
Contributor

Update @sighingnow will create a PR to rollback:

image As this is no-longer needed.

With that rolled back I think this PR should be as simple as self.model_runner._dummy_run(1, uniform_decode=True)

I've created the PR to roll this back: #25407

@MatthewBonanni
Copy link
Contributor

@Bam4d I converted #25407 into a duplicate of this (listing you as author) because this needed to be merged today. We should be able to close this one. Thanks for your contribution!

@tlrmchlsmth
Copy link
Member

in favor of 25407

@Bam4d
Copy link
Contributor Author

Bam4d commented Sep 23, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants