Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented May 23, 2025

What this PR does / why we need it?

With this PR, we can migrate to the native data_parallel.py in vllm examples and remove the version in vllm-ascend.

At present, ASCEND_RT_VISIBLE_DEVICES introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device.

Does this PR introduce any user-facing change?

How was this patch tested?

@yiz-liu yiz-liu force-pushed the fix-dp branch 2 times, most recently from d45d021 to e2b4a7f Compare May 26, 2025 02:43
@yiz-liu yiz-liu force-pushed the fix-dp branch 2 times, most recently from 48dd5cf to dee8fc6 Compare June 6, 2025 09:22
from vllm.config import ParallelConfig
from vllm.config import ParallelConfig, VllmConfig
from vllm.v1.engine.core import DPEngineCoreProc
from vllm.utils import stateless_init_torch_distributed_process_group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean vllm.distributed.utils ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@yiz-liu yiz-liu force-pushed the fix-dp branch 2 times, most recently from ea92048 to f0a39a2 Compare June 6, 2025 12:28
@wangxiyuan wangxiyuan added the ready read for review label Jun 6, 2025
return "vllm_ascend.compilation.piecewise_backend.NPUPiecewiseBackend" # noqa

@classmethod
def stateless_init_device_torch_dist_pg(
Copy link
Collaborator

@wangxiyuan wangxiyuan Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this func called?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is called by vllm. Please refer to this pr: vllm-project/vllm#18763

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, it's a change for main branch

@@ -0,0 +1,116 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be imported in patch_0_9_0/__init__.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

…UWorker and DPEngineCoreProc

Co-authored-by: rjg-lyh <rjg-lyh@users.noreply.github.com>

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@wangxiyuan wangxiyuan merged commit 6003afa into vllm-project:main Jun 9, 2025
22 of 23 checks passed
@yiz-liu yiz-liu deleted the fix-dp branch June 10, 2025 09:32
Comment on lines +79 to +85

# NOTE(Yizhou): Since we do not set ASCEND_RT_VISIBLE_DEVICES in
# vllm_ascend, we need to set the device id manually.
local_dp_rank = self.vllm_config.parallel_config.data_parallel_rank_local
world_size = self.vllm_config.parallel_config.world_size
self.local_rank_across_dp = local_dp_rank * world_size + self.local_rank

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you're doing this. The vLLM set the ASCEND_RT_VISIBLE_DEVICES correctly. @wangxiyuan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, however, due to unresolved issues with this environment variable, it cannot be configured at runtime, which will lead to multiple processes running on a single device.

You can try revert this PR and run vllm/examples/offline_inference/data_parallel.py and it should reproduce the bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to file a issue to record this, I remember the new version cann and pytorch is required?

@1StepForever
Copy link

1StepForever commented Jun 19, 2025

Same Problem when using ray and DP > 1 TP >1 (TP=1 is fine), vllm set ASCEND_RT_VISIBLE_DEVICES correctly, but TPrank >1 processes running on wrong device.

chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
With this PR, we can migrate to the native `data_parallel.py` in vllm
examples and remove the version in vllm-ascend.

At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable
difficulties; therefore, we must employ a temporary workaround and
manually specify the device.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
With this PR, we can migrate to the native `data_parallel.py` in vllm
examples and remove the version in vllm-ascend.

At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable
difficulties; therefore, we must employ a temporary workaround and
manually specify the device.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core ready read for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants