-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 15/n) Adding a user cgroup subtree for non-ray processes. #57269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CgroupManagerFactory which constructs a cross-platform cgroup manager with selective compilation Signed-off-by: irabbani <israbbani@gmail.com>
and CgroupManagerFactory are the only public targets. CgroupManagerFactory will delegate to the appropriate implementation for each platform. Signed-off-by: irabbani <israbbani@gmail.com>
build. The Cgroup subsystem only exposes CgroupManagerInterface and CgroupManagerFactory as public targets. Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14 Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14 Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: israbbani <israbbani@gmail.com>
subtrees: - the system cgroup has all ray system processes. - the workers cgroup has all ray worker processes. - the user cgroup has all other non-ray processes on the system (usually used with containers). Updated the integration tests. Signed-off-by: irabbani <israbbani@gmail.com>
16 tasks
Signed-off-by: irabbani <israbbani@gmail.com>
… irabbani/cgroups-14 Signed-off-by: irabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
… irabbani/cgroups-15 Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
Signed-off-by: irabbani <israbbani@gmail.com>
edoakes
approved these changes
Oct 13, 2025
harshit-anyscale
pushed a commit
that referenced
this pull request
Oct 15, 2025
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes
pushed a commit
that referenced
this pull request
Oct 15, 2025
…int (#57731) In #57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
justinyeh1995
pushed a commit
to justinyeh1995/ray
that referenced
this pull request
Oct 20, 2025
…int (ray-project#57731) In ray-project#57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com>
xinyuangui2
pushed a commit
to xinyuangui2/ray
that referenced
this pull request
Oct 22, 2025
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
xinyuangui2
pushed a commit
to xinyuangui2/ray
that referenced
this pull request
Oct 22, 2025
…int (ray-project#57731) In ray-project#57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Oct 23, 2025
…ses. (#57269) This PR stacks on #57244. For more details about the resource isolation project see #54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn
pushed a commit
that referenced
this pull request
Oct 23, 2025
…int (#57731) In #57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter
pushed a commit
to landscapepainter/ray
that referenced
this pull request
Nov 17, 2025
…int (ray-project#57731) In ray-project#57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…ses. (ray-project#57269) This PR stacks on ray-project#57244. For more details about the resource isolation project see ray-project#54703. In the previous ray cgroup hierarchy, all processes that were in the path `--cgroup-path` were moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes. The new cgroup hierarchy looks like ``` cgroup_path (e.g. /sys/fs/cgroup) | ray-node_<node_id> | | system user | | | leaf workers non-ray ``` The cgroups contain the following processes * system/leaf (all ray non-worker processes e.g. raylet, runtime_env_agent, gcs_server, ...) * user/workers (all ray worker processes) * user/non-ray (all non-ray processes migrated from cgroup_path). Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to `user/non-ray` The following controllers will be enabled * cgroup_path (cpu, memory) * ray-node_<node_id> (cpu, memory) * system (memory) The following constraints are applied * system (cpu.weight, memory.min) * user (cpu.weight) --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…int (ray-project#57731) In ray-project#57269, I introduced a bug: * the `cpu` controller is not enabled on the system and user cgroups because it only needs to be enabled on the parent. * however, CgroupDriver::AddConstraint checked that for every constraint, the matching controller was enabled. This was not caught in CI because the python integration tests were not running. For the time being, I've removed the manual tag and excluded `cgroup` from all other python tests. We need to fix this properly because tests shouldn't exclude other tests but rather include the right targets. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR stacks on #57244.
For more details about the resource isolation project see #54703.
In the previous ray cgroup hierarchy, all processes that were in the path
--cgroup-pathwere moved into the system cgroup. This changes the hierarchy to now have a separate cgroup for all non-ray processes.The new cgroup hierarchy looks like
The cgroups contain the following processes
Note: If you're running ray inside a container, all non-ray processes running in the container will be migrated to
user/non-rayThe following controllers will be enabled
The following constraints are applied