[Core] Enable CUDA graphs for DP + All2All kernels #18724

varun-sundar-rabindranath · 2025-05-26T17:12:36Z

Enable CUDA Graphs for DP + All2All kernels.

Fixes:

The input buffers to the quant_method aren't captured properly when using CUDAGraphs + torch.compile. This PR introduces a staging area where the hidden_states and router_logits are copied into and it is this tensor that gets passed into quant_method.
It is important that all DP ranks invoke the same number of dispatch and combine kernels. The kernels need to synchronize between DP ranks. When this requirement isn't respected, it manifests as a deadlock. To this effect, introduce a get_dp_padding method in gpu_model_runner.py.

Tests:
Verified correctness using lm_eval locally on 4xH100.

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

github-actions · 2025-05-26T17:12:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath · 2025-05-26T17:43:08Z

@bnellnm @youkaichao @tlrmchlsmth PTAL! Thanks.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

bnellnm · 2025-05-27T15:19:03Z

vllm/model_executor/layers/fused_moe/layer.py

+        self.batched_hidden_states: Optional[torch.Tensor] = None
+        self.batched_router_logits: Optional[torch.Tensor] = None
+        if self.moe_parallel_config.use_pplx_kernels:
+            act_dtype = torch.get_default_dtype()


Is this always correct? Can we make this a part of the moe config (even if we get it from the same place).

I am not sure if it is always correct. @mgoin is there a better way to get the activation dtype here ? any pointers ? Thanks.

Switched to using dtype from model_config after speaking to Michael IRL.
@bnellnm About moving act_dtype to the moe config - act_dtype is only used here, do you see value in storing it moe config anyways ?

vllm/model_executor/layers/fused_moe/layer.py

bnellnm · 2025-05-27T15:23:19Z

vllm/v1/worker/gpu_model_runner.py

+        # Gather num_tokens across dp rank
+        num_tokens_across_dp = [0] * dp_size
+        num_tokens_across_dp[dp_rank] = num_tokens
+        num_tokens_tensor = torch.tensor(num_tokens_across_dp,
+                                         device="cpu",
+                                         dtype=torch.int32)
+        from vllm.distributed.parallel_state import get_dp_group
+        torch.distributed.all_reduce(num_tokens_tensor,
+                                     group=get_dp_group().cpu_group)
+        max_tokens_across_dp_cpu = torch.max(num_tokens_tensor).item()
+        return max_tokens_across_dp_cpu - num_tokens


Can't you get this from the forward_context?

This is called before the forward_context is set.

Looks like each DP rank is padding out to the maximum number of tokens? That's definitely too much padding in the chunked prefill case. Might be OK for disagg P/D.

BTW, could you factor this bit of code out so we don't duplicate it both here and in the forward_context?

Yeah - the padding is pretty aggressive - I can come back to this on a follow up PR (I tried doing a simpler approach and faced some deadlock issues)

BTW, could you factor this bit of code out so we don't duplicate it both here and in the forward_context?
Refactored the code a bit 👍

vllm/v1/worker/gpu_model_runner.py

vllm/model_executor/layers/fused_moe/layer.py

varun-sundar-rabindranath · 2025-05-27T15:30:41Z

vllm/v1/worker/gpu_model_runner.py

+        # Gather num_tokens across dp rank
+        num_tokens_across_dp = [0] * dp_size
+        num_tokens_across_dp[dp_rank] = num_tokens
+        num_tokens_tensor = torch.tensor(num_tokens_across_dp,
+                                         device="cpu",
+                                         dtype=torch.int32)
+        from vllm.distributed.parallel_state import get_dp_group
+        torch.distributed.all_reduce(num_tokens_tensor,
+                                     group=get_dp_group().cpu_group)
+        max_tokens_across_dp_cpu = torch.max(num_tokens_tensor).item()
+        return max_tokens_across_dp_cpu - num_tokens


This is called before the forward_context is set.

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

tlrmchlsmth · 2025-05-28T20:57:34Z

vllm/forward_context.py

nice cleanup

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: amit <amit.man@gmail.com>

Varun Sundar Rabindranath added 3 commits May 26, 2025 12:00

enable cuda graphs

eb81f34

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

pad for DP

0f5dd89

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

cleanup

ea7c789

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 26, 2025 17:12

mergify bot added the v1 label May 26, 2025

Varun Sundar Rabindranath added 2 commits May 26, 2025 13:40

fix lint

1eae61b

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

remove cruft

28a0abd

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

mgoin reviewed May 26, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

Varun Sundar Rabindranath added 2 commits May 26, 2025 13:55

cleanup

da67a8b

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix dtype

7c42818

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

bnellnm reviewed May 27, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

bnellnm reviewed May 27, 2025

View reviewed changes

varun-sundar-rabindranath commented May 27, 2025

View reviewed changes

Varun Sundar Rabindranath added 7 commits May 27, 2025 12:04

add asserts

5eaf9ac

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

bug fix

3c29a9c

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

get act_dtype from model_config

ec73b4a

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix mypy

e734cfa

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

refactor

7185fae

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix typing

55f7349

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix type

98c3b7e

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

bnellnm approved these changes May 28, 2025

View reviewed changes

tlrmchlsmth approved these changes May 28, 2025

View reviewed changes

vllm/forward_context.py

Copy link

Member

tlrmchlsmth May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 28, 2025

tlrmchlsmth enabled auto-merge (squash) May 28, 2025 20:58

tlrmchlsmth merged commit 7951d78 into vllm-project:main May 28, 2025
74 checks passed

njhill mentioned this pull request Jun 9, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Closed

37 tasks

This was referenced Jun 27, 2025

[Bug]: Distributed Tests (4 GPUs) failing in main branch CI #20138

Closed

fix ci issue distributed 4 gpu test #20204

Merged

This was referenced Aug 4, 2025

[Feature]: Simple Data Parallelism in vLLM #9206

Open

[Feature]: supporting full torch.compile mode (enforce_eager=False) for vllm serve --data-parallel-size on a single-node #22090

Closed

Uh oh!

[Core] Enable CUDA graphs for DP + All2All kernels #18724

[Core] Enable CUDA graphs for DP + All2All kernels #18724

Uh oh!

Conversation

varun-sundar-rabindranath commented May 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

varun-sundar-rabindranath commented May 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varun-sundar-rabindranath commented May 26, 2025 •

edited by github-actions bot

Loading