Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][refactor] Move accelerator-specific environment variables to ray_constants.py to avoid redefining them #51026

Merged
merged 6 commits into from
Mar 11, 2025

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Mar 3, 2025

Why are these changes needed?

Env vars like CUDA_VISIBLE_DEVICES_ENV_VAR are defined in both ray_constants.py and nvidia_gpu.py. This PR moves related env vars to ray_constants.py to avoid redefining them.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
TPU_VISIBLE_CHIPS_ENV_VAR = "TPU_VISIBLE_CHIPS"
NPU_RT_VISIBLE_DEVICES_ENV_VAR = "ASCEND_RT_VISIBLE_DEVICES"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as ASCEND_RT_VISIBLE_DEVICES_ENV_VAR.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 3, 2025
@kevin85421 kevin85421 marked this pull request as ready for review March 3, 2025 09:40
@edoakes
Copy link
Collaborator

edoakes commented Mar 3, 2025

Hm I think it's preferable to keep accelerator-specific logic isolated in the relevant accelerators/ file to avoid polluting the top-level Ray core namespace.

If we want to avoid redefining commonly-used ones, we can import them into ray_constants as well.

@jjyao WDYT?

@kevin85421
Copy link
Member Author

kevin85421 commented Mar 3, 2025

Hm I think it's preferable to keep accelerator-specific logic isolated in the relevant accelerators/ file to avoid polluting the top-level Ray core namespace.

Note that some environment variables are used not only by xxxxxxpu.py but also by other files, which import them from ray_constants.py. For example, you can search CUDA_VISIBLE_DEVICES_ENV_VAR.

If we want to avoid redefining commonly-used ones, we can import them into ray_constants as well.

In my opinion, it's better not to import any Ray-related dependencies into ray_constants.py to avoid potential circular dependency issues in the future. We should aim to keep it as a top-level Python module in Ray.

@kevin85421
Copy link
Member Author

Another way is to create accelerator_constants.py and it doesn't import any other Ray Python modules. All files want to use the env vars need to import the file.

@@ -55,7 +55,7 @@ def get_devices(self) -> List[torch.device]:

if len(npu_ids) > 0:
npu_visible_str = os.environ.get(
ray_constants.NPU_RT_VISIBLE_DEVICES_ENV_VAR, ""
ray_constants.ASCEND_RT_VISIBLE_DEVICES_ENV_VAR, ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid duplication, we can remove from ray_constants.py and just import the relevant accelerator file to get the corresponding constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also makes sense. @edoakes does this make sense to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this sounds better to me

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
@kevin85421 kevin85421 force-pushed the 20250216-devbox-2-tmux-7-ray4 branch from 3179e4c to ad1e135 Compare March 9, 2025 06:20
@kevin85421 kevin85421 requested a review from edoakes March 9, 2025 21:52
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🚀

@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup):
- Worker2: "0,1"
"""
_share_accelerator_ids(
worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR
worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu PTAL for codeowner approval, and also can you help me understand why we depend on this env var directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids] (where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.

@edoakes edoakes enabled auto-merge (squash) March 10, 2025 13:01
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup):
- Worker2: "0,1"
"""
_share_accelerator_ids(
worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR
worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids] (where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.

@edoakes edoakes merged commit 75ce52a into ray-project:master Mar 11, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants