Skip to content

[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

izhuhaoran
Copy link
Contributor

@izhuhaoran izhuhaoran commented May 30, 2025

PR Description

This PR optimizes data parallel (DP) communication by reusing the num_tokens_across_dp tensor, thus avoiding a redundant all-reduce operation.

In the current implementation, both execute_model and _dummy_run functions in gpu_model_runner.py call get_dp_padding. This method performs an all-reduce operation to gather the number of tokens on each DP rank (num_tokens_across_dp).

Subsequently, these functions call set_forward_context, which in turn calls DPMetadata.make. Inside DPMetadata.make, another all-reduce operation is performed for the exact same purpose – to determine the number of tokens on each DP rank.

This change modifies get_dp_padding to return the num_tokens_across_dp tensor it computes. This tensor is then passed to DPMetadata.make via set_forward_context. By doing so, we eliminate the redundant all-reduce in DPMetadata.make, reducing inter-GPU communication overhead.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@izhuhaoran
Copy link
Contributor Author

izhuhaoran commented May 30, 2025

Hi @njhill @DarkLight1337 @jeejeelee (or anyone else who's available), Would you mind taking a look at this PR ? Thanks.

@izhuhaoran izhuhaoran marked this pull request as draft May 30, 2025 06:02
@izhuhaoran izhuhaoran force-pushed the remove-unnecessary-dp-allreduce branch from ca53fa8 to 9036074 Compare May 30, 2025 06:37
@izhuhaoran izhuhaoran marked this pull request as ready for review May 30, 2025 06:37
@youkaichao youkaichao requested a review from tlrmchlsmth May 30, 2025 09:07
@youkaichao
Copy link
Member

@tlrmchlsmth @varun-sundar-rabindranath can help take a look.

@varun-sundar-rabindranath
Copy link
Contributor

Thanks @izhuhaoran ! Left some comments. PTAL. Thanks!

@izhuhaoran
Copy link
Contributor Author

Thanks @izhuhaoran ! Left some comments. PTAL. Thanks!

Thanks for the feedback, @varun-sundar-rabindranath . I've pushed updates addressing your comments. PTAL.

Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @izhuhaoran 🚀

@varun-sundar-rabindranath
Copy link
Contributor

cc @tlrmchlsmth @bnellnm @simon-mo @youkaichao - Can you take another look and enable auto-merge please! Thanks 🙌

Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good (and a nice optimization, thanks @izhuhaoran!)

But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

@izhuhaoran
Copy link
Contributor Author

These changes look good (and a nice optimization, thanks @izhuhaoran!)

But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

Thanks for the review, @tlrmchlsmth !

That's a great point about the DP padding. From my perspective, it should indeed only apply when enforce_eager is False (CUDA graphs will be used).

What do you think about refining the pad logic to something like this:

        # Padding for DP cuda graph only if enforce_eager is false
        if not self.vllm_config.model_config.enforce_eager:
            num_pad, num_tokens_across_dp = self.get_dp_padding(num_input_tokens)
            num_input_tokens += num_pad
        else:
            num_tokens_across_dp = None

Also cc @varun-sundar-rabindranath

@izhuhaoran izhuhaoran requested a review from tlrmchlsmth May 31, 2025 17:47
@tlrmchlsmth
Copy link
Collaborator

These changes look good (and a nice optimization, thanks @izhuhaoran!)
But I might be missing something. Are we padding num_tokens_after_padding even when not using CUDA Graphs, @varun-sundar-rabindranath?

Thanks for the review, @tlrmchlsmth !

That's a great point about the DP padding. From my perspective, it should indeed only apply when enforce_eager is False (CUDA graphs will be used).

What do you think about refining the pad logic to something like this:

        # Padding for DP cuda graph only if enforce_eager is false
        if not self.vllm_config.model_config.enforce_eager:
            num_pad, num_tokens_across_dp = self.get_dp_padding(num_input_tokens)
            num_input_tokens += num_pad
        else:
            num_tokens_across_dp = None

Also cc @varun-sundar-rabindranath

I was talking to Varun over slack about this as well -- I think this is a good temporary solution. Once added to the PR I think it will be ready to accept and merge. For P/D this will let us turn on eager for the prefill instance and use CUDA graphs for the decoder. We'll need to follow up later to make this more robust but that can be a separate PR

@varun-sundar-rabindranath
Copy link
Contributor

I agree that the padding isn't needed for the enforce-eager case.

@izhuhaoran - your logic for skipping the padding looks good - but, it'd be nice to make the if statement part of the early-exit logic in get_dp_padding function.
something like,

  if dp_size == 1 or self.vllm_config.model_config.enforce_eager:
        return 0, None

@izhuhaoran
Copy link
Contributor Author

I agree that the padding isn't needed for the enforce-eager case.

@izhuhaoran - your logic for skipping the padding looks good - but, it'd be nice to make the if statement part of the early-exit logic in get_dp_padding function. something like,

  if dp_size == 1 or self.vllm_config.model_config.enforce_eager:
        return 0, None

Agreed, that's much cleaner. Thanks for the suggestion! I'll update this PR.

@izhuhaoran
Copy link
Contributor Author

I've updated the PR as suggested. DP padding is now correctly skipped when enforce_eager=True (via an early exit in get_dp_padding). PTAL. Thanks! @tlrmchlsmth @varun-sundar-rabindranath

Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 1, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) June 1, 2025 16:17
@izhuhaoran
Copy link
Contributor Author

LGTM, thank you!

Thanks for your time!

@tlrmchlsmth tlrmchlsmth merged commit d6fd3a3 into vllm-project:main Jun 1, 2025
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants