-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Core] Enable CUDA graphs for DP + All2All kernels #18724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enable CUDA graphs for DP + All2All kernels #18724
Conversation
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
@bnellnm @youkaichao @tlrmchlsmth PTAL! Thanks. |
| self.batched_hidden_states: Optional[torch.Tensor] = None | ||
| self.batched_router_logits: Optional[torch.Tensor] = None | ||
| if self.moe_parallel_config.use_pplx_kernels: | ||
| act_dtype = torch.get_default_dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always correct? Can we make this a part of the moe config (even if we get it from the same place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is always correct. @mgoin is there a better way to get the activation dtype here ? any pointers ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using dtype from model_config after speaking to Michael IRL.
@bnellnm About moving act_dtype to the moe config - act_dtype is only used here, do you see value in storing it moe config anyways ?
vllm/v1/worker/gpu_model_runner.py
Outdated
| # Gather num_tokens across dp rank | ||
| num_tokens_across_dp = [0] * dp_size | ||
| num_tokens_across_dp[dp_rank] = num_tokens | ||
| num_tokens_tensor = torch.tensor(num_tokens_across_dp, | ||
| device="cpu", | ||
| dtype=torch.int32) | ||
| from vllm.distributed.parallel_state import get_dp_group | ||
| torch.distributed.all_reduce(num_tokens_tensor, | ||
| group=get_dp_group().cpu_group) | ||
| max_tokens_across_dp_cpu = torch.max(num_tokens_tensor).item() | ||
| return max_tokens_across_dp_cpu - num_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you get this from the forward_context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called before the forward_context is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like each DP rank is padding out to the maximum number of tokens? That's definitely too much padding in the chunked prefill case. Might be OK for disagg P/D.
BTW, could you factor this bit of code out so we don't duplicate it both here and in the forward_context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - the padding is pretty aggressive - I can come back to this on a follow up PR (I tried doing a simpler approach and faced some deadlock issues)
BTW, could you factor this bit of code out so we don't duplicate it both here and in the forward_context?
Refactored the code a bit 👍
vllm/v1/worker/gpu_model_runner.py
Outdated
| # Gather num_tokens across dp rank | ||
| num_tokens_across_dp = [0] * dp_size | ||
| num_tokens_across_dp[dp_rank] = num_tokens | ||
| num_tokens_tensor = torch.tensor(num_tokens_across_dp, | ||
| device="cpu", | ||
| dtype=torch.int32) | ||
| from vllm.distributed.parallel_state import get_dp_group | ||
| torch.distributed.all_reduce(num_tokens_tensor, | ||
| group=get_dp_group().cpu_group) | ||
| max_tokens_across_dp_cpu = torch.max(num_tokens_tensor).item() | ||
| return max_tokens_across_dp_cpu - num_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called before the forward_context is set.
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice cleanup
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: amit <amit.man@gmail.com>
Enable CUDA Graphs for DP + All2All kernels.
Fixes:
get_dp_paddingmethod ingpu_model_runner.py.Tests:
Verified correctness using lm_eval locally on 4xH100.