[BugFix] Fix data parallel #940

yiz-liu · 2025-05-23T11:24:18Z

What this PR does / why we need it?

With this PR, we can migrate to the native data_parallel.py in vllm examples and remove the version in vllm-ascend.

At present, ASCEND_RT_VISIBLE_DEVICES introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device.

Does this PR introduce any user-facing change?

How was this patch tested?

linfeng-yuan · 2025-06-06T10:00:00Z

vllm_ascend/patch/platform/patch_common/patch_distributed.py

-from vllm.config import ParallelConfig
+from vllm.config import ParallelConfig, VllmConfig
+from vllm.v1.engine.core import DPEngineCoreProc
+from vllm.utils import stateless_init_torch_distributed_process_group


Do you mean vllm.distributed.utils ?

wangxiyuan · 2025-06-07T14:59:53Z

vllm_ascend/platform.py

        return "vllm_ascend.compilation.piecewise_backend.NPUPiecewiseBackend"  # noqa
+
+    @classmethod
+    def stateless_init_device_torch_dist_pg(


where is this func called?

this function is called by vllm. Please refer to this pr: vllm-project/vllm#18763

Make sense, it's a change for main branch

wangxiyuan · 2025-06-07T15:00:23Z

vllm_ascend/patch/platform/patch_0_9_0/patch_distributed.py

@@ -0,0 +1,116 @@
+import torch


this file should be imported in patch_0_9_0/__init__.py

…UWorker and DPEngineCoreProc Co-authored-by: rjg-lyh <rjg-lyh@users.noreply.github.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

jianzs · 2025-06-14T07:00:55Z

vllm_ascend/worker/worker_v1.py

+
+        # NOTE(Yizhou): Since we do not set ASCEND_RT_VISIBLE_DEVICES in
+        # vllm_ascend, we need to set the device id manually.
+        local_dp_rank = self.vllm_config.parallel_config.data_parallel_rank_local
+        world_size = self.vllm_config.parallel_config.world_size
+        self.local_rank_across_dp = local_dp_rank * world_size + self.local_rank
+


I don't understand why you're doing this. The vLLM set the ASCEND_RT_VISIBLE_DEVICES correctly. @wangxiyuan

Correct, however, due to unresolved issues with this environment variable, it cannot be configured at runtime, which will lead to multiple processes running on a single device.

You can try revert this PR and run vllm/examples/offline_inference/data_parallel.py and it should reproduce the bug.

It's better to file a issue to record this, I remember the new version cann and pytorch is required?

1StepForever · 2025-06-19T02:14:09Z

Same Problem when using ray and DP > 1 TP >1 (TP=1 is fine), vllm set ASCEND_RT_VISIBLE_DEVICES correctly, but TPrank >1 processes running on wrong device.

### What this PR does / why we need it? With this PR, we can migrate to the native `data_parallel.py` in vllm examples and remove the version in vllm-ascend. At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the fix-dp branch 2 times, most recently from d45d021 to e2b4a7f Compare May 26, 2025 02:43

MengqingCao mentioned this pull request Jun 5, 2025

[Bug]: 设置 DP=2 时，设备初始化出错 #1060

Open

yiz-liu force-pushed the fix-dp branch 2 times, most recently from 48dd5cf to dee8fc6 Compare June 6, 2025 09:22

github-actions bot added the module:core label Jun 6, 2025

linfeng-yuan reviewed Jun 6, 2025

View reviewed changes

yiz-liu force-pushed the fix-dp branch 2 times, most recently from ea92048 to f0a39a2 Compare June 6, 2025 12:28

wangxiyuan added the ready read for review label Jun 6, 2025

wangxiyuan reviewed Jun 7, 2025

View reviewed changes

[BugFix] Fix data parallel initialization and device management in NP…

f64c822

…UWorker and DPEngineCoreProc Co-authored-by: rjg-lyh <rjg-lyh@users.noreply.github.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

linfeng-yuan force-pushed the fix-dp branch from 0a517a1 to f64c822 Compare June 9, 2025 02:21

wangxiyuan approved these changes Jun 9, 2025

View reviewed changes

wangxiyuan merged commit 6003afa into vllm-project:main Jun 9, 2025
22 of 23 checks passed

yiz-liu deleted the fix-dp branch June 10, 2025 09:32

david6666666 mentioned this pull request Jun 11, 2025

[Bug]:DP Crash, After first request, process is crash, 'DPEngineCoreProc' object has no attribute 'dp_rank' #1170

Closed

jianzs reviewed Jun 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] Fix data parallel #940

[BugFix] Fix data parallel #940

Uh oh!

yiz-liu commented May 23, 2025

Uh oh!

linfeng-yuan Jun 6, 2025

Uh oh!

yiz-liu Jun 6, 2025

Uh oh!

wangxiyuan Jun 7, 2025 •

edited

Loading

Uh oh!

linfeng-yuan Jun 9, 2025

Uh oh!

wangxiyuan Jun 9, 2025

Uh oh!

wangxiyuan Jun 7, 2025

Uh oh!

linfeng-yuan Jun 9, 2025

Uh oh!

Uh oh!

jianzs Jun 14, 2025

Uh oh!

yiz-liu Jun 14, 2025

Uh oh!

Yikun Jun 14, 2025

Uh oh!

1StepForever commented Jun 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[BugFix] Fix data parallel #940

[BugFix] Fix data parallel #940

Uh oh!

Conversation

yiz-liu commented May 23, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1StepForever commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wangxiyuan Jun 7, 2025 •

edited

Loading

1StepForever commented Jun 19, 2025 •

edited

Loading