-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context #18935
Conversation
…all reduce in set_forward_context
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Hi @njhill @DarkLight1337 @jeejeelee (or anyone else who's available), Would you mind taking a look at this PR ? Thanks. |
ca53fa8
to
9036074
Compare
@tlrmchlsmth @varun-sundar-rabindranath can help take a look. |
Thanks @izhuhaoran ! Left some comments. PTAL. Thanks! |
Thanks for the feedback, @varun-sundar-rabindranath . I've pushed updates addressing your comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @izhuhaoran 🚀
cc @tlrmchlsmth @bnellnm @simon-mo @youkaichao - Can you take another look and enable auto-merge please! Thanks 🙌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good (and a nice optimization, thanks @izhuhaoran!)
But I might be missing something. Are we padding num_tokens_after_padding
even when not using CUDA Graphs, @varun-sundar-rabindranath?
Thanks for the review, @tlrmchlsmth ! That's a great point about the DP padding. From my perspective, it should indeed only apply when enforce_eager is False (CUDA graphs will be used). What do you think about refining the pad logic to something like this: # Padding for DP cuda graph only if enforce_eager is false
if not self.vllm_config.model_config.enforce_eager:
num_pad, num_tokens_across_dp = self.get_dp_padding(num_input_tokens)
num_input_tokens += num_pad
else:
num_tokens_across_dp = None Also cc @varun-sundar-rabindranath |
I was talking to Varun over slack about this as well -- I think this is a good temporary solution. Once added to the PR I think it will be ready to accept and merge. For P/D this will let us turn on eager for the prefill instance and use CUDA graphs for the decoder. We'll need to follow up later to make this more robust but that can be a separate PR |
I agree that the padding isn't needed for the @izhuhaoran - your logic for skipping the padding looks good - but, it'd be nice to make the if statement part of the early-exit logic in
|
Agreed, that's much cleaner. Thanks for the suggestion! I'll update this PR. |
I've updated the PR as suggested. DP padding is now correctly skipped when enforce_eager=True (via an early exit in get_dp_padding). PTAL. Thanks! @tlrmchlsmth @varun-sundar-rabindranath |
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Thanks for your time! |
PR Description
This PR optimizes data parallel (DP) communication by reusing the
num_tokens_across_dp
tensor, thus avoiding a redundant all-reduce operation.In the current implementation, both
execute_model
and_dummy_run
functions ingpu_model_runner.py
callget_dp_padding
. This method performs an all-reduce operation to gather the number of tokens on each DP rank (num_tokens_across_dp
).Subsequently, these functions call
set_forward_context
, which in turn callsDPMetadata.make
. InsideDPMetadata.make
, another all-reduce operation is performed for the exact same purpose – to determine the number of tokens on each DP rank.This change modifies
get_dp_padding
to return thenum_tokens_across_dp
tensor it computes. This tensor is then passed toDPMetadata.make
viaset_forward_context
. By doing so, we eliminate the redundant all-reduce inDPMetadata.make
, reducing inter-GPU communication overhead.