[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935

izhuhaoran · 2025-05-30T03:41:03Z

PR Description

This PR optimizes data parallel (DP) communication by reusing the num_tokens_across_dp tensor, thus avoiding a redundant all-reduce operation.

In the current implementation, both execute_model and _dummy_run functions in gpu_model_runner.py call get_dp_padding. This method performs an all-reduce operation to gather the number of tokens on each DP rank (num_tokens_across_dp).

Subsequently, these functions call set_forward_context, which in turn calls DPMetadata.make. Inside DPMetadata.make, another all-reduce operation is performed for the exact same purpose – to determine the number of tokens on each DP rank.

This change modifies get_dp_padding to return the num_tokens_across_dp tensor it computes. This tensor is then passed to DPMetadata.make via set_forward_context. By doing so, we eliminate the redundant all-reduce in DPMetadata.make, reducing inter-GPU communication overhead.

…all reduce in set_forward_context

github-actions · 2025-05-30T03:46:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

izhuhaoran · 2025-05-30T05:42:53Z

Hi @njhill @DarkLight1337 @jeejeelee (or anyone else who's available), Would you mind taking a look at this PR ? Thanks.

youkaichao · 2025-05-30T09:08:24Z

@tlrmchlsmth @varun-sundar-rabindranath can help take a look.

vllm/forward_context.py

vllm/v1/worker/gpu_model_runner.py

varun-sundar-rabindranath · 2025-05-30T13:25:44Z

Thanks @izhuhaoran ! Left some comments. PTAL. Thanks!

izhuhaoran · 2025-05-31T02:56:00Z

Thanks @izhuhaoran ! Left some comments. PTAL. Thanks!

Thanks for the feedback, @varun-sundar-rabindranath . I've pushed updates addressing your comments. PTAL.

varun-sundar-rabindranath

LGTM! Thanks @izhuhaoran 🚀

varun-sundar-rabindranath · 2025-05-31T15:08:15Z

cc @tlrmchlsmth @bnellnm @simon-mo @youkaichao - Can you take another look and enable auto-merge please! Thanks 🙌

tlrmchlsmth

These changes look good (and a nice optimization, thanks @izhuhaoran!)

But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

izhuhaoran · 2025-05-31T17:31:39Z

These changes look good (and a nice optimization, thanks @izhuhaoran!)

But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

Thanks for the review, @tlrmchlsmth !

That's a great point about the DP padding. From my perspective, it should indeed only apply when enforce_eager is False (CUDA graphs will be used).

What do you think about refining the pad logic to something like this:

        # Padding for DP cuda graph only if enforce_eager is false
        if not self.vllm_config.model_config.enforce_eager:
            num_pad, num_tokens_across_dp = self.get_dp_padding(num_input_tokens)
            num_input_tokens += num_pad
        else:
            num_tokens_across_dp = None

Also cc @varun-sundar-rabindranath

tlrmchlsmth · 2025-05-31T18:08:57Z

These changes look good (and a nice optimization, thanks @izhuhaoran!)
But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

Thanks for the review, @tlrmchlsmth !

That's a great point about the DP padding. From my perspective, it should indeed only apply when enforce_eager is False (CUDA graphs will be used).

What do you think about refining the pad logic to something like this:
        # Padding for DP cuda graph only if enforce_eager is false
        if not self.vllm_config.model_config.enforce_eager:
            num_pad, num_tokens_across_dp = self.get_dp_padding(num_input_tokens)
            num_input_tokens += num_pad
        else:
            num_tokens_across_dp = None
Also cc @varun-sundar-rabindranath

I was talking to Varun over slack about this as well -- I think this is a good temporary solution. Once added to the PR I think it will be ready to accept and merge. For P/D this will let us turn on eager for the prefill instance and use CUDA graphs for the decoder. We'll need to follow up later to make this more robust but that can be a separate PR

varun-sundar-rabindranath · 2025-05-31T18:10:42Z

I agree that the padding isn't needed for the enforce-eager case.

@izhuhaoran - your logic for skipping the padding looks good - but, it'd be nice to make the if statement part of the early-exit logic in get_dp_padding function.
something like,

  if dp_size == 1 or self.vllm_config.model_config.enforce_eager:
        return 0, None

izhuhaoran · 2025-05-31T18:16:57Z

I agree that the padding isn't needed for the enforce-eager case.

@izhuhaoran - your logic for skipping the padding looks good - but, it'd be nice to make the if statement part of the early-exit logic in get_dp_padding function. something like,
  if dp_size == 1 or self.vllm_config.model_config.enforce_eager:
        return 0, None

Agreed, that's much cleaner. Thanks for the suggestion! I'll update this PR.

izhuhaoran · 2025-05-31T18:30:00Z

I've updated the PR as suggested. DP padding is now correctly skipped when enforce_eager=True (via an early exit in get_dp_padding). PTAL. Thanks! @tlrmchlsmth @varun-sundar-rabindranath

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>

tlrmchlsmth

LGTM, thank you!

izhuhaoran · 2025-06-01T16:19:08Z

LGTM, thank you!

Thanks for your time!

reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp …

9b277b5

…all reduce in set_forward_context

izhuhaoran requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 30, 2025 03:41

mergify bot added the v1 label May 30, 2025

fix TypeError when dp_size=1

34f2309

izhuhaoran marked this pull request as draft May 30, 2025 06:02

update num_tokens_after_padding in get_dp_padding

9036074

izhuhaoran force-pushed the remove-unnecessary-dp-allreduce branch from ca53fa8 to 9036074 Compare May 30, 2025 06:37

izhuhaoran marked this pull request as ready for review May 30, 2025 06:37

youkaichao requested a review from tlrmchlsmth May 30, 2025 09:07

varun-sundar-rabindranath reviewed May 30, 2025

View reviewed changes

vllm/forward_context.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed May 30, 2025

View reviewed changes

vllm/forward_context.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath reviewed May 30, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

varun-sundar-rabindranath reviewed May 30, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

refactor code

f88ad72

izhuhaoran requested a review from varun-sundar-rabindranath May 31, 2025 02:56

varun-sundar-rabindranath approved these changes May 31, 2025

View reviewed changes

tlrmchlsmth reviewed May 31, 2025

View reviewed changes

izhuhaoran requested a review from varun-sundar-rabindranath May 31, 2025 17:41

izhuhaoran requested a review from tlrmchlsmth May 31, 2025 17:47

remove unnecessary pad for enforce_eager

f0df0b0

Add a comment

ddd396b

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>

tlrmchlsmth approved these changes Jun 1, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2025

tlrmchlsmth enabled auto-merge (squash) June 1, 2025 16:17

tlrmchlsmth merged commit d6fd3a3 into vllm-project:main Jun 1, 2025
71 checks passed

Uh oh!

[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935

[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935

Uh oh!

Conversation

izhuhaoran commented May 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Uh oh!

github-actions bot commented May 30, 2025

Uh oh!

izhuhaoran commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao commented May 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varun-sundar-rabindranath commented May 30, 2025

Uh oh!

izhuhaoran commented May 31, 2025

Uh oh!

varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented May 31, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented May 31, 2025

Uh oh!

tlrmchlsmth commented May 31, 2025

Uh oh!

varun-sundar-rabindranath commented May 31, 2025

Uh oh!

izhuhaoran commented May 31, 2025

Uh oh!

izhuhaoran commented May 31, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

izhuhaoran commented Jun 1, 2025

Uh oh!

Uh oh!

Uh oh!

izhuhaoran commented May 30, 2025 •

edited by github-actions bot

Loading

izhuhaoran commented May 30, 2025 •

edited

Loading