[DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY #23849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

vllm-bot merged 7 commits into vllm-project:main from ruisearch42:pack_dp_ranks

Oct 10, 2025

Collaborator

ruisearch42 commented Aug 28, 2025 •

edited by github-actions bot

Loading

Purpose

Currently we only strictly pack dp_size_local for the master node. However, in case of DeepEP it assumes EP ranks [0, 7] are on the same node, (same for [8, 15], etc.) and uses cuda IPC for communication among them. If this is not satisfied, a runtime error is raised because cuda IPC does not work cross-node. This PR fixes the issue by restricting the placement.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

ruisearch42 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners

August 28, 2025 17:19

mergify bot added the v1 label

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request refactors the data parallel placement group creation logic in Ray to ensure that dp_size_local ranks are strictly packed onto the same node. This is an important change for use cases like DeepEP. While the overall direction is correct, I've identified a critical bug in the implementation that prevents scheduling on any node other than the master node, which would break multi-node data parallelism. My review includes a suggested fix for this issue.

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

ruisearch42 force-pushed the pack_dp_ranks branch from 73eb8a6 to 26eaecc Compare

September 3, 2025 23:55

ruisearch42 changed the title ~~[DP][ray] Strictly pack dp_size_local ranks to the same node~~ [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY

1d86f50

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 force-pushed the pack_dp_ranks branch from 345dcfc to 1d86f50 Compare

October 7, 2025 23:37


          ruff

d298a7f

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 mentioned this pull request

[Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) #26367

Merged

Member

youkaichao commented Oct 8, 2025

cjackal reviewed

View reviewed changes

vllm/v1/engine/utils.py Show resolved Hide resolved

up

9c70a52

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kouroshHakha reviewed

View reviewed changes

vllm/envs.py Outdated

    
                  # - "strict":

                  #   allocate exactly data-parallel-size-local DP ranks to each picked node;

                  # This environment variable is ignored if data-parallel-backend is not Ray.

                  "VLLM_RAY_DP_PACK_STRATEGY": lambda: os.getenv("VLLM_RAY_DP_PACK_STRATEGY", "fill"),

Collaborator

kouroshHakha Oct 9, 2025

shouldn't the default be strict?

Collaborator Author

ruisearch42 Oct 9, 2025

updated

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

ruisearch42 added 2 commits

October 9, 2025 16:23

up

18bd9d7

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

up

0fe3ad8

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

kouroshHakha reviewed

View reviewed changes

Collaborator

kouroshHakha left a comment

needs one change

vllm/v1/engine/utils.py Outdated Show resolved Hide resolved

up

30e6b4a

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 added the ready label

kouroshHakha approved these changes

View reviewed changes

Collaborator

kouroshHakha commented Oct 9, 2025

LGTM

njhill approved these changes

View reviewed changes

Member

njhill left a comment

Thanks @ruisearch42 @kouroshHakha

up

d815631

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 enabled auto-merge (squash)

October 9, 2025 23:14

vllm-bot merged commit 757fa4a into vllm-project:main

44 of 47 checks passed

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

c089ac3

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

fddb2f1

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

bbartels pushed a commit to bbartels/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

3017b89

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

49cbbac

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

c6e6445

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

28cacb3

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

df5be1b

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

d8ed040

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request


          [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (vllm-project#2…

d42a4f3

…3849)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

njhill njhill approved these changes

WoosukKwon Awaiting requested review from WoosukKwon

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat

ywang96 Awaiting requested review from ywang96

comaniac Awaiting requested review from comaniac

alexm-redhat Awaiting requested review from alexm-redhat

+3 more reviewers

cjackal cjackal left review comments

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

kouroshHakha kouroshHakha approved these changes

Labels