-
Notifications
You must be signed in to change notification settings - Fork 544
[BugFix] Fix data parallel #940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d45d021 to
e2b4a7f
Compare
48dd5cf to
dee8fc6
Compare
| from vllm.config import ParallelConfig | ||
| from vllm.config import ParallelConfig, VllmConfig | ||
| from vllm.v1.engine.core import DPEngineCoreProc | ||
| from vllm.utils import stateless_init_torch_distributed_process_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean vllm.distributed.utils ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
ea92048 to
f0a39a2
Compare
| return "vllm_ascend.compilation.piecewise_backend.NPUPiecewiseBackend" # noqa | ||
|
|
||
| @classmethod | ||
| def stateless_init_device_torch_dist_pg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this func called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is called by vllm. Please refer to this pr: vllm-project/vllm#18763
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, it's a change for main branch
| @@ -0,0 +1,116 @@ | |||
| import torch | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file should be imported in patch_0_9_0/__init__.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
…UWorker and DPEngineCoreProc Co-authored-by: rjg-lyh <rjg-lyh@users.noreply.github.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
|
|
||
| # NOTE(Yizhou): Since we do not set ASCEND_RT_VISIBLE_DEVICES in | ||
| # vllm_ascend, we need to set the device id manually. | ||
| local_dp_rank = self.vllm_config.parallel_config.data_parallel_rank_local | ||
| world_size = self.vllm_config.parallel_config.world_size | ||
| self.local_rank_across_dp = local_dp_rank * world_size + self.local_rank | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why you're doing this. The vLLM set the ASCEND_RT_VISIBLE_DEVICES correctly. @wangxiyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, however, due to unresolved issues with this environment variable, it cannot be configured at runtime, which will lead to multiple processes running on a single device.
You can try revert this PR and run vllm/examples/offline_inference/data_parallel.py and it should reproduce the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to file a issue to record this, I remember the new version cann and pytorch is required?
|
Same Problem when using ray and DP > 1 TP >1 (TP=1 is fine), vllm set ASCEND_RT_VISIBLE_DEVICES correctly, but TPrank >1 processes running on wrong device. |
### What this PR does / why we need it? With this PR, we can migrate to the native `data_parallel.py` in vllm examples and remove the version in vllm-ascend. At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? With this PR, we can migrate to the native `data_parallel.py` in vllm examples and remove the version in vllm-ascend. At present, `ASCEND_RT_VISIBLE_DEVICES` introduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
What this PR does / why we need it?
With this PR, we can migrate to the native
data_parallel.pyin vllm examples and remove the version in vllm-ascend.At present,
ASCEND_RT_VISIBLE_DEVICESintroduces considerable difficulties; therefore, we must employ a temporary workaround and manually specify the device.Does this PR introduce any user-facing change?
How was this patch tested?