Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Oct 6, 2025

For more details about the resource isolation project see #54703.

This PR introduces two public bazel targets from the //src/ray/common/cgroup2 subsystem.

  • CgroupManagerFactory is a cross-platform target that exports a working CgroupManager on Linux if resource isolation is enabled. It exports a Noop implementation if running on a non-Linux platform or if resource isolation is not enabled on Linux.
  • CgroupManagerInterface is the public API of CgroupManager.

It also introduces a few other changes

  1. All resource isolation related configuration parsing and input validation has been moved into CgroupManagerFactory.
  2. NodeManager now controls the lifecycle (and destruction) of CgroupManager.
  3. SysFsCgroupDriver uses a linux header file to find the path of the mount file instead of hardcoding because different linux distributions can use different files.

CgroupManagerFactory which constructs a cross-platform cgroup manager
with selective compilation

Signed-off-by: irabbani <israbbani@gmail.com>
and CgroupManagerFactory are the only public targets.
CgroupManagerFactory will delegate to the appropriate implementation for
each platform.

Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Oct 6, 2025
israbbani and others added 6 commits October 6, 2025 23:22
build. The Cgroup subsystem only exposes CgroupManagerInterface and
CgroupManagerFactory as public targets.

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani israbbani changed the title [wip] [core] (cgroups 14/n) Clean up bazel targets and support cross-platform build. [core] (cgroups 14/n) Clean up bazel targets and support cross-platform build. Oct 7, 2025
israbbani and others added 4 commits October 7, 2025 04:32
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
@israbbani israbbani marked this pull request as ready for review October 7, 2025 18:58
@israbbani israbbani requested a review from a team as a code owner October 7, 2025 18:58
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 7, 2025
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Comment on lines +23 to +24
// TODO(54703): Refactor the configs into a struct called CgroupManagerConfig
// and delegate input validation and error messages to it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah!

Comment on lines +44 to +46
RAY_CHECK(!cgroup_path.empty())
<< "Failed to start CgroupManager. If enable_resource_isolation is set to true, "
"cgroup_path cannot be empty.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the general case, should we structure the factories so that they RAY_CHECK on initialization failure or return a status code? the paranoid/perfectionist version would be the latter. practically I'm not sure that it makes much of a difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing our discussion for posterity.

If there's a panic or a fatal error, I would keep the actual RAY_CHECK as close to the error as possible. You're not worried about clean up and you can dump as much context as possible.

I would return a Status is if there's a chance that the caller can either recover provide more useful information.
However, this has the unpleasant side-effect of having landmines buried deep in the call stack.. which makes me sad.

The antidote to this sadness is to have as few FATAL errors as possible and most failures should be recoverable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invariant checking on component startups or detecting misconfigurations is a valid use of RAY_CHECK.

@edoakes
Copy link
Collaborator

edoakes commented Oct 7, 2025

minor comments; ping when ready to merge

israbbani and others added 2 commits October 7, 2025 15:19
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani
Copy link
Contributor Author

israbbani commented Oct 7, 2025

Gonna trigger 1 more set of MacOS and Windows tests. I'll let you know when they pass.

Signed-off-by: israbbani <israbbani@gmail.com>
@israbbani
Copy link
Contributor Author

@edoakes the MacOS failure looks unrelated to the change or to core:

(11:07:33) INFO: Running command line: bazel-bin/ci/ray_ci/automation/test_db_bot core /tmp/bazel_event_logs
--
  | Traceback (most recent call last):
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/io_ray/ci/ray_ci/automation/test_db_bot.py", line 41, in <module>
  | main()
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/py_deps_buildkite_click/click/core.py", line 1157, in __call__
  | return self.main(*args, **kwargs)
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/py_deps_buildkite_click/click/core.py", line 1078, in main
  | rv = self.invoke(ctx)
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/py_deps_buildkite_click/click/core.py", line 1434, in invoke
  | return ctx.invoke(self.callback, **ctx.params)
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/py_deps_buildkite_click/click/core.py", line 783, in invoke
  | return __callback(*args, **kwargs)
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/io_ray/ci/ray_ci/automation/test_db_bot.py", line 36, in main
  | TesterContainer.upload_test_results(team, bazel_log_dir)
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/io_ray/ci/ray_ci/tester_container.py", line 158, in upload_test_results
  | for test, result in cls.get_test_and_results(team, bazel_log_dir):
  | File "/private/var/tmp/_bazel_ec2-user/b625f2b2abd7a5bdbf5aca7c03f4a916/execroot/io_ray/bazel-out/darwin-opt/bin/ci/ray_ci/automation/test_db_bot.runfiles/io_ray/ci/ray_ci/tester_container.py", line 191, in get_test_and_results
  | for file in listdir(bazel_log_dir):
  | FileNotFoundError: [Errno 2] No such file or directory: '/tmp/bazel_event_logs'```

@edoakes edoakes enabled auto-merge (squash) October 10, 2025 20:41
@edoakes edoakes merged commit 7164809 into master Oct 10, 2025
6 of 7 checks passed
@edoakes edoakes deleted the irabbani/cgroups-14 branch October 10, 2025 22:45
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
edoakes added a commit that referenced this pull request Oct 13, 2025
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Oct 13, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
harshit-anyscale pushed a commit that referenced this pull request Oct 15, 2025
…rm build. (#57244)

For more details about the resource isolation project see
#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
harshit-anyscale pushed a commit that referenced this pull request Oct 15, 2025
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…rm build. (#57244)

For more details about the resource isolation project see
#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…rm build. (ray-project#57244)

For more details about the resource isolation project see
ray-project#54703.

This PR introduces two public bazel targets from the
`//src/ray/common/cgroup2` subsystem.
* `CgroupManagerFactory` is a cross-platform target that exports a
working CgroupManager on Linux if resource isolation is enabled. It
exports a Noop implementation if running on a non-Linux platform or if
resource isolation is not enabled on Linux.
* `CgroupManagerInterface` is the public API of CgroupManager.

It also introduces a few other changes
1. All resource isolation related configuration parsing and input
validation has been moved into CgroupManagerFactory.
2. NodeManager now controls the lifecycle (and destruction) of
CgroupManager.
3. SysFsCgroupDriver uses a linux header file to find the path of the
mount file instead of hardcoding because different linux distributions
can use different files.

---------

Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants