Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Oct 7, 2025

This PR stacks on #57244.

For more details about the resource isolation project see #54703.

In the previous ray cgroup hierarchy, all processes that were in the path --cgroup-path were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like

      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray

The cgroups contain the following processes

  • system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...)
  • user/workers (all ray worker processes)
  • user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to user/non-ray

The following controllers will be enabled

  • cgroup_path (cpu, memory)
  • ray-node_<node_id> (cpu, memory)
  • system (memory)

The following constraints are applied

  • system (cpu.weight, memory.min)
  • user (cpu.weight)

israbbani and others added 13 commits September 30, 2025 17:17
CgroupManagerFactory which constructs a cross-platform cgroup manager
with selective compilation

Signed-off-by: irabbani <israbbani@gmail.com>
and CgroupManagerFactory are the only public targets.
CgroupManagerFactory will delegate to the appropriate implementation for
each platform.

Signed-off-by: irabbani <israbbani@gmail.com>
build. The Cgroup subsystem only exposes CgroupManagerInterface and
CgroupManagerFactory as public targets.

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
subtrees:
- the system cgroup has all ray system processes.
- the workers cgroup has all ray worker processes.
- the user cgroup has all other non-ray processes on the system (usually
  used with containers).

Updated the integration tests.

Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Oct 7, 2025
@israbbani israbbani changed the base branch from master to irabbani/cgroups-14 October 7, 2025 17:47
israbbani and others added 8 commits October 7, 2025 11:59
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14

Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
Base automatically changed from irabbani/cgroups-14 to master October 10, 2025 22:45
israbbani and others added 5 commits October 10, 2025 23:35
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
… irabbani/cgroups-15

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
israbbani and others added 3 commits October 10, 2025 23:50
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
@israbbani israbbani marked this pull request as ready for review October 11, 2025 15:39
@israbbani israbbani requested a review from a team as a code owner October 11, 2025 15:39
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 11, 2025
@edoakes edoakes merged commit 798b85a into master Oct 13, 2025
6 checks passed
@edoakes edoakes deleted the irabbani/cgroups-15 branch October 13, 2025 13:29
harshit-anyscale pushed a commit that referenced this pull request Oct 15, 2025
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit that referenced this pull request Oct 15, 2025
…int (#57731)

In #57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…int (ray-project#57731)

In ray-project#57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…int (ray-project#57731)

In ray-project#57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…ses. (#57269)

This PR stacks on #57244.

For more details about the resource isolation project see
#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…int (#57731)

In #57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…int (ray-project#57731)

In ray-project#57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ses. (ray-project#57269)

This PR stacks on ray-project#57244.

For more details about the resource isolation project see
ray-project#54703.

In the previous ray cgroup hierarchy, all processes that were in the
path `--cgroup-path` were moved into the system cgroup. This changes the
hierarchy to now have a separate cgroup for all non-ray processes.

The new cgroup hierarchy looks like
```
      cgroup_path (e.g. /sys/fs/cgroup)
            |
    ray-node_<node_id>
    |                 |
  system             user
    |               |    |
  leaf        workers  non-ray
```

The cgroups contain the following processes
* system/leaf (all ray non-worker processes e.g. raylet,
runtime_env_agent, gcs_server, ...)
* user/workers (all ray worker processes)
* user/non-ray (all non-ray processes migrated from cgroup_path).

Note: If you're running ray inside a container, all non-ray processes
running in the container will be migrated to `user/non-ray`

The following controllers will be enabled
* cgroup_path (cpu, memory)
* ray-node_<node_id> (cpu, memory)
* system (memory)

The following constraints are applied
* system (cpu.weight, memory.min)
* user (cpu.weight)

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…int (ray-project#57731)

In ray-project#57269, I introduced a bug:
* the `cpu` controller is not enabled on the system and user cgroups
because it only needs to be enabled on the parent.
* however, CgroupDriver::AddConstraint checked that for every
constraint, the matching controller was enabled.

This was not caught in CI because the python integration tests were not
running. For the time being, I've removed the manual tag and excluded
`cgroup` from all other python tests.

We need to fix this properly because tests shouldn't exclude other tests
but rather include the right targets.

---------

Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants