[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963

Bam4d · 2025-09-16T12:22:55Z

When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.

without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.

Before

BS1 DP8

Mean ITL (ms):                           137.62    
Median ITL (ms):                         131.40    
P99 ITL (ms):                            194.54

BS32 DP8

Mean ITL (ms):                           30.49     
Median ITL (ms):                         30.09     
P99 ITL (ms):                            46.84

After

BS1 DP8

Mean ITL (ms):                           31.70     << Significantly faster ITL at BS1
Median ITL (ms):                         30.82     
P99 ITL (ms):                            49.60

BS32 DP8

Mean ITL (ms):                           30.10     
Median ITL (ms):                         29.82     
P99 ITL (ms):                            35.32

gemini-code-assist

Code Review

This pull request addresses a performance issue with data parallelism (DP) when a batch does not fill all DP ranks. The change modifies execute_dummy_batch to pass appropriate parameters (uniform_decode and cudagraph_runtime_mode) to the underlying _dummy_run method. This ensures that the dummy batch execution path aligns with whether CUDA graphs are enabled or not, preventing a significant slowdown on idle ranks. The logic appears sound and the benchmarks provided demonstrate a clear performance improvement. However, I found a potential crash scenario in the implementation.

gemini-code-assist · 2025-09-16T12:24:47Z

vllm/v1/worker/gpu_worker.py

+        eager = self.model_config.enforce_eager
+        cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \
+            else CUDAGraphMode.FULL
+        uniform_decode = not eager
+        self.model_runner._dummy_run(1, 
+                                     uniform_decode=uniform_decode, 
+                                     cudagraph_runtime_mode=cudagraph_runtime_mode)


The current logic for determining whether to use CUDA graphs for the dummy batch is incomplete. It only checks self.model_config.enforce_eager but misses self.vllm_config.compilation_config.cudagraph_mode.

If enforce_eager is False but cudagraph_mode is NONE, the current code will incorrectly attempt to run the dummy batch with cudagraph_runtime_mode=CUDAGraphMode.FULL. This will lead to an AssertionError in _dummy_run because the CUDA graph dispatcher will not have been initialized, causing the worker to crash.

The logic should be updated to consider both flags to correctly determine if a CUDA graph path should be used.

Suggested change

eager = self.model_config.enforce_eager

cudagraph_runtime_mode = CUDAGraphMode.NONE if eager \

else CUDAGraphMode.FULL

uniform_decode = not eager

self.model_runner._dummy_run(1,

uniform_decode=uniform_decode,

cudagraph_runtime_mode=cudagraph_runtime_mode)

use_graph = (not self.model_config.enforce_eager and

self.vllm_config.compilation_config.cudagraph_mode !=

CUDAGraphMode.NONE)

cudagraph_runtime_mode = CUDAGraphMode.FULL if use_graph \

else CUDAGraphMode.NONE

uniform_decode = use_graph

self.model_runner._dummy_run(1,

uniform_decode=uniform_decode,

cudagraph_runtime_mode=cudagraph_runtime_mode)

I think we can just use the cudagraph_mode flag here and not use enforce_eager at all

robertgshaw2-redhat · 2025-09-16T13:17:53Z

Thank you for this PR!

LucasWilkinson

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2812 to 2815 in 4e5affe

    
           _cg_mode, batch_descriptor = \ 
        
               self.cudagraph_dispatcher.dispatch( 
        
                   BatchDescriptor(num_tokens=num_tokens, 
        
                                   uniform_decode=uniform_decode))

Bam4d · 2025-09-16T16:35:42Z

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2812 to 2815 in 4e5affe

_cg_mode, batch_descriptor = \

self.cudagraph_dispatcher.dispatch(

BatchDescriptor(num_tokens=num_tokens,

uniform_decode=uniform_decode))

So in the case of execute dummy, we set force_attention=True + uniform_decode=True and then let the dispatcher set cudagraph_runtime_mode?

I'm not sure what all the codepaths are that need to be taken care of here.

LucasWilkinson · 2025-09-17T19:35:52Z

Thank you for the contribution; this is a great catch! but we shouldn't assume FULL if not eager cudagraph_mode controls this; I think the easiest would be to make cudagraph_runtime_mode optional and then letting the dispatcher resolve the needed cudagraph mode:

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2812 to 2815 in 4e5affe

_cg_mode, batch_descriptor = \

self.cudagraph_dispatcher.dispatch(

BatchDescriptor(num_tokens=num_tokens,

uniform_decode=uniform_decode))

So in the case of execute dummy, we set force_attention=True + uniform_decode=True and then let the dispatcher set cudagraph_runtime_mode?

I'm not sure what all the codepaths are that need to be taken care of here.

I would think the simplest change would be self.model_runner._dummy_run(1, uniform_decode=True) to basically set it up to use a decode cudagraph; uniform_decode would result in the dummy batch being constructed via

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2837 to 2843 in 7ae9887

    
           elif uniform_decode: 
        
               num_reqs = num_tokens // max_query_len 
        
               assert num_reqs <= max_num_reqs, \ 
        
                   "Do not capture num_reqs > max_num_reqs for uniform batch" 
        
               num_scheduled_tokens_list = [max_query_len] * num_reqs 
        
               if num_tokens % max_query_len != 0: 
        
                   num_scheduled_tokens_list[-1] += num_tokens % max_query_len

and would make to a decode cudagraph naturally but it seems #24526 modified that code:

so now it looks like we would end up with num_reqs = 0 for spec-decode (i.e. max_query_len > 1)

Im following up on this change. Overall large areas of this code need to be refactored :/

LucasWilkinson · 2025-09-20T20:11:20Z

Update @sighingnow will create a PR to rollback:

As this is no-longer needed.

With that rolled back I think this PR should be as simple as self.model_runner._dummy_run(1, uniform_decode=True)

MatthewBonanni · 2025-09-22T17:06:05Z

Update @sighingnow will create a PR to rollback:
As this is no-longer needed.
With that rolled back I think this PR should be as simple as self.model_runner._dummy_run(1, uniform_decode=True)

I've created the PR to roll this back: #25407

MatthewBonanni · 2025-09-23T01:42:45Z

@Bam4d I converted #25407 into a duplicate of this (listing you as author) because this needed to be merged today. We should be able to close this one. Thanks for your contribution!

tlrmchlsmth · 2025-09-23T01:49:35Z

in favor of 25407

Bam4d · 2025-09-23T10:06:14Z

Thanks!

Bam4d added 2 commits September 16, 2025 12:16

execute dummy can slow down batches if cuda graphs are not enabled

bfc5d5f

Merge branch 'main' of github.com:Bam4d/vllm into bam4d/fix-bs-lte-dp

2c1e5fd

Bam4d requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 16, 2025 12:22

Bam4d changed the title ~~Bam4d/fix bs lte dp~~ Fix slow execution when BS <= DP Sep 16, 2025

mergify bot added the v1 label Sep 16, 2025

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

robertgshaw2-redhat changed the title ~~Fix slow execution when BS <= DP~~ [DP/EP] Fix slow execution when BS <= DP Sep 16, 2025

LucasWilkinson requested changes Sep 16, 2025

View reviewed changes

DingYinfan mentioned this pull request Sep 17, 2025

[Bug]: When using DP8, runing seqs num in 8 engine is not average, perf loss #24564

Open

1 task

simon-mo added this to the v0.10.3 milestone Sep 19, 2025

njhill changed the title ~~[DP/EP] Fix slow execution when BS <= DP~~ [BugFix] [DP/EP] Fix slow execution when BS <= DP Sep 19, 2025

MatthewBonanni mentioned this pull request Sep 22, 2025

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

Merged

tlrmchlsmth closed this Sep 23, 2025

MatthewBonanni mentioned this pull request Sep 23, 2025

[Bug]: AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch #25494

Closed

1 task

	_cg_mode, batch_descriptor = \
	self.cudagraph_dispatcher.dispatch(
	BatchDescriptor(num_tokens=num_tokens,
	uniform_decode=uniform_decode))

Uh oh!

[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963

[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963

Uh oh!

Conversation

Bam4d commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Sep 16, 2025

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bam4d commented Sep 16, 2025

Uh oh!

LucasWilkinson commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson commented Sep 20, 2025

Uh oh!

MatthewBonanni commented Sep 22, 2025

Uh oh!

MatthewBonanni commented Sep 23, 2025

Uh oh!

tlrmchlsmth commented Sep 23, 2025

Uh oh!

Bam4d commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Bam4d commented Sep 16, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading

LucasWilkinson commented Sep 17, 2025 •

edited

Loading