[core][refactor] Move accelerator-specific environment variables to `ray_constants.py` to avoid redefining them #51026

kevin85421 · 2025-03-03T00:53:49Z

Why are these changes needed?

Env vars like CUDA_VISIBLE_DEVICES_ENV_VAR are defined in both ray_constants.py and nvidia_gpu.py. This PR moves related env vars to ray_constants.py to avoid redefining them.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 · 2025-03-03T00:55:07Z

python/ray/_private/ray_constants.py

 TPU_VISIBLE_CHIPS_ENV_VAR = "TPU_VISIBLE_CHIPS"
-NPU_RT_VISIBLE_DEVICES_ENV_VAR = "ASCEND_RT_VISIBLE_DEVICES"


This is the same as ASCEND_RT_VISIBLE_DEVICES_ENV_VAR.

edoakes · 2025-03-03T14:06:27Z

Hm I think it's preferable to keep accelerator-specific logic isolated in the relevant accelerators/ file to avoid polluting the top-level Ray core namespace.

If we want to avoid redefining commonly-used ones, we can import them into ray_constants as well.

@jjyao WDYT?

kevin85421 · 2025-03-03T16:44:32Z

Hm I think it's preferable to keep accelerator-specific logic isolated in the relevant accelerators/ file to avoid polluting the top-level Ray core namespace.

Note that some environment variables are used not only by xxxxxxpu.py but also by other files, which import them from ray_constants.py. For example, you can search CUDA_VISIBLE_DEVICES_ENV_VAR.

If we want to avoid redefining commonly-used ones, we can import them into ray_constants as well.

In my opinion, it's better not to import any Ray-related dependencies into ray_constants.py to avoid potential circular dependency issues in the future. We should aim to keep it as a top-level Python module in Ray.

kevin85421 · 2025-03-03T16:57:23Z

Another way is to create accelerator_constants.py and it doesn't import any other Ray Python modules. All files want to use the env vars need to import the file.

jjyao · 2025-03-04T14:13:18Z

python/ray/air/_internal/device_manager/npu.py

@@ -55,7 +55,7 @@ def get_devices(self) -> List[torch.device]:

            if len(npu_ids) > 0:
                npu_visible_str = os.environ.get(
-                    ray_constants.NPU_RT_VISIBLE_DEVICES_ENV_VAR, ""
+                    ray_constants.ASCEND_RT_VISIBLE_DEVICES_ENV_VAR, ""


To avoid duplication, we can remove from ray_constants.py and just import the relevant accelerator file to get the corresponding constant?

This also makes sense. @edoakes does this make sense to you?

yea this sounds better to me

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

edoakes

Nice! 🚀

edoakes · 2025-03-10T13:00:56Z

python/ray/train/v2/_internal/callbacks/accelerators.py

@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup):
        - Worker2: "0,1"
    """
    _share_accelerator_ids(
-        worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR
+        worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR


@justinvyu PTAL for codeowner approval, and also can you help me understand why we depend on this env var directly?

We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids] (where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.

justinvyu

Thanks!

justinvyu · 2025-03-11T00:37:56Z

python/ray/train/v2/_internal/callbacks/accelerators.py

@@ -63,7 +64,7 @@ def _share_cuda_visible_devices(worker_group: WorkerGroup):
        - Worker2: "0,1"
    """
    _share_accelerator_ids(
-        worker_group, ray_constants.GPU, ray_constants.CUDA_VISIBLE_DEVICES_ENV_VAR
+        worker_group, ray_constants.GPU, CUDA_VISIBLE_DEVICES_ENV_VAR


We need to basically overwrite the Ray Core behavior of restricting CUDA_VISIBLE_DEVICES=[ray_gpu_ids] (where ray_gpu_ids is a list of devices assigned to the actor) to instead set the env var to the set of ALL devices on a node that are used by training workers in the group. NCCL needs this in order to do cross device communication during training.

kevin85421 added 3 commits March 3, 2025 00:45

update

ceb8a0d

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update

c31837f

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

update

00f51a2

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 commented Mar 3, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 3, 2025

kevin85421 marked this pull request as ready for review March 3, 2025 09:40

kevin85421 requested review from hongpeng-guo, justinvyu, matthewdeng, raulchen and woshiyyya as code owners March 3, 2025 09:40

kevin85421 assigned edoakes Mar 3, 2025

jjyao reviewed Mar 4, 2025

View reviewed changes

address comments

ad1e135

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

kevin85421 force-pushed the 20250216-devbox-2-tmux-7-ray4 branch from 3179e4c to ad1e135 Compare March 9, 2025 06:20

kevin85421 added 2 commits March 9, 2025 06:24

address comments

4aa2a83

Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>

Merge branch 'master' into 20250216-devbox-2-tmux-7-ray4

e2f5daf

kevin85421 requested a review from edoakes March 9, 2025 21:52

edoakes approved these changes Mar 10, 2025

View reviewed changes

edoakes reviewed Mar 10, 2025

View reviewed changes

edoakes enabled auto-merge (squash) March 10, 2025 13:01

justinvyu approved these changes Mar 11, 2025

View reviewed changes

edoakes merged commit 75ce52a into ray-project:master Mar 11, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][refactor] Move accelerator-specific environment variables to `ray_constants.py` to avoid redefining them #51026

[core][refactor] Move accelerator-specific environment variables to `ray_constants.py` to avoid redefining them #51026

kevin85421 commented Mar 3, 2025 •

edited

Loading

kevin85421 Mar 3, 2025

edoakes commented Mar 3, 2025

kevin85421 commented Mar 3, 2025 •

edited

Loading

kevin85421 commented Mar 3, 2025

jjyao Mar 4, 2025

kevin85421 Mar 4, 2025

edoakes Mar 4, 2025

edoakes left a comment

edoakes Mar 10, 2025

justinvyu Mar 11, 2025

justinvyu left a comment

justinvyu Mar 11, 2025

		TPU_VISIBLE_CHIPS_ENV_VAR = "TPU_VISIBLE_CHIPS"
		NPU_RT_VISIBLE_DEVICES_ENV_VAR = "ASCEND_RT_VISIBLE_DEVICES"

[core][refactor] Move accelerator-specific environment variables to ray_constants.py to avoid redefining them #51026

[core][refactor] Move accelerator-specific environment variables to ray_constants.py to avoid redefining them #51026

Conversation

kevin85421 commented Mar 3, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 Mar 3, 2025

Choose a reason for hiding this comment

edoakes commented Mar 3, 2025

kevin85421 commented Mar 3, 2025 • edited Loading

kevin85421 commented Mar 3, 2025

jjyao Mar 4, 2025

Choose a reason for hiding this comment

kevin85421 Mar 4, 2025

Choose a reason for hiding this comment

edoakes Mar 4, 2025

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

edoakes Mar 10, 2025

Choose a reason for hiding this comment

justinvyu Mar 11, 2025

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Mar 11, 2025

Choose a reason for hiding this comment

[core][refactor] Move accelerator-specific environment variables to `ray_constants.py` to avoid redefining them #51026

[core][refactor] Move accelerator-specific environment variables to `ray_constants.py` to avoid redefining them #51026

kevin85421 commented Mar 3, 2025 •

edited

Loading

kevin85421 commented Mar 3, 2025 •

edited

Loading