-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][refactor] Move accelerator-specific environment variables to ray_constants.py
to avoid redefining them
#51026
[core][refactor] Move accelerator-specific environment variables to ray_constants.py
to avoid redefining them
#51026
Conversation
TPU_VISIBLE_CHIPS_ENV_VAR = "TPU_VISIBLE_CHIPS" | ||
NPU_RT_VISIBLE_DEVICES_ENV_VAR = "ASCEND_RT_VISIBLE_DEVICES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as ASCEND_RT_VISIBLE_DEVICES_ENV_VAR
.
Hm I think it's preferable to keep accelerator-specific logic isolated in the relevant If we want to avoid redefining commonly-used ones, we can import them into @jjyao WDYT? |
Note that some environment variables are used not only by
In my opinion, it's better not to import any Ray-related dependencies into |
Another way is to create |
@@ -55,7 +55,7 @@ def get_devices(self) -> List[torch.device]: | |||
|
|||
if len(npu_ids) > 0: | |||
npu_visible_str = os.environ.get( | |||
ray_constants.NPU_RT_VISIBLE_DEVICES_ENV_VAR, "" | |||
ray_constants.ASCEND_RT_VISIBLE_DEVICES_ENV_VAR, "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid duplication, we can remove from ray_constants.py and just import the relevant accelerator file to get the corresponding constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also makes sense. @edoakes does this make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea this sounds better to me
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
3179e4c
to
ad1e135
Compare
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 🚀
@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup): | |||
- Worker2: "0,1" | |||
""" | |||
_share_accelerator_ids( | |||
worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR | |||
worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinvyu PTAL for codeowner approval, and also can you help me understand why we depend on this env var directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids]
(where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup): | |||
- Worker2: "0,1" | |||
""" | |||
_share_accelerator_ids( | |||
worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR | |||
worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids]
(where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.
Why are these changes needed?
Env vars like
CUDA_VISIBLE_DEVICES_ENV_VAR
are defined in bothray_constants.py
andnvidia_gpu.py
. This PR moves related env vars toray_constants.py
to avoid redefining them.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.