Skip to content

Conversation

@israbbani
Copy link
Contributor

@israbbani israbbani commented Jul 30, 2025

This PR adds integration tests for the SysFsCgroupDriver.
This PR stacks on [#54898]. For more details about the resource isolation project see #54703.

The design goals of the tests are:

  1. they are fast and repeatable (i.e. not flaky)
  2. they don't leak any state (cgroups, directories, processes)
  3. they can be run inside a VM (for local development) or in a container (with escalated privileges in CI)

The tests make the following assumptions:

  1. Cgroupv2 is mounted in unified mode i.e. cgroup1 is disabled.
  2. The user that starts the entrypoint script sysfs_cgroup_driver_integration_test_entrypoint.sh has read-write access to the ROOT_CGROUP (which is defined by a variable inside sysfs_cgroup_driver_integration_test_entrypoint.sh)
  3. The user has permissions to create a new user (called cgroup-tester) to be able to run tests with different privilege levels. This is useful e.g. when you run the setup/teardown as the root user, but want to code paths that check for permissions and require an unprivileged user to test the unhappy path (i.e. user does not have permissions to a cgroup and a write fails)

The tests created scoped cgroup hierarchy to simplify cleanup and isolate test runs from each other. This looks like:

#                        ROOT_CGROUP
#                             |
#                        BASE_CGROUP
#                       /           \
#                 TEST_CGROUP   LEAF_CGROUP

Where

  • ROOT_CGROUP is provided by the environment running the test (for CI it's /sys/fs/cgroup)
  • BASE_CGROUP is a random cgroup created for the test suite (e.g. /sys/fs/cgroup/testing.axUYw
  • TEST_CGROUP is the cgroup used by the integration tests
  • LEAF_CGROUP is used to migrate all processes in the ROOT_CGROUP (if it has any) so that it can be allowed to have child cgroups and still pass the no internal processes constraint.

The tests follow a peculiar CI pattern:

  • The endpoint is triggered directly from the buildkite CI script (instead of through bazel). This is because if you run the endpoint as a bazel sh_test then, no other user has access to the bazel workspace and we cannot run the setup/teardown as a privileged user and the cpp tests as an unprivileged user.

Example output from a test run:

Starting Cgroupv2 Integration Tests as user root
ROOT_CGROUP is /sys/fs/cgroup.
ROOT_CGROUP is /sys/fs/cgroup.
BASE_CGROUP for the test suite is /sys/fs/cgroup/testing.ydOxJ.
TEST_CGROUP for the test suite is /sys/fs/cgroup/testing.ydOxJ/test.
LEAF_CGROUP for the test suite is /sys/fs/cgroup/testing.ydOxJ/leaf.
Starting integration test fixture with:
  ACTION=setup
  ROOT_CGROUP=/sys/fs/cgroup
  BASE_CGROUP=/sys/fs/cgroup/testing.ydOxJ
  TEST_CGROUP=/sys/fs/cgroup/testing.ydOxJ/test
  UNPRIV_USER=cgroup-tester
Running ACTION: setup
Created LEAF_CGROUP at /sys/fs/cgroup/testing.ydOxJ/leaf.
Created TEST_CGROUP at /sys/fs/cgroup/testing.ydOxJ/test.
Moved 3 procs from /sys/fs/cgroup/cgroup.procs to /sys/fs/cgroup/testing.ydOxJ/leaf/cgroup.procs.
Updated +cpu +memory controllers for /sys/fs/cgroup/cgroup.subtree_control
Updated +cpu +memory controllers for /sys/fs/cgroup/testing.ydOxJ/cgroup.subtree_control
Updated +cpu +memory controllers for /sys/fs/cgroup/testing.ydOxJ/test/cgroup.subtree_control
Created unprivilged user cgroup-tester.
cgroup-tester is the owner the cgroup subtree starting at /sys/fs/cgroup/testing.ydOxJ
exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //src/ray/common/cgroup2/tests:sysfs_cgroup_driver_integration_test
-----------------------------------------------------------------------------
Running main() from gmock_main.cc
[==========] Running 36 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 36 tests from SysFsCgroupDriverIntegrationTest
[ RUN      ] SysFsCgroupDriverIntegrationTest.SysFsCgroupDriverIntegrationTestFailsIfNoCgroupTestPathSpecified
[       OK ] SysFsCgroupDriverIntegrationTest.SysFsCgroupDriverIntegrationTestFailsIfNoCgroupTestPathSpecified (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoReadPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoReadPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoWritePermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoWritePermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoExecPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CheckCgroupFailsIfCgroupv2PathButNoExecPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CheckCgroupSucceedsIfCgroupv2PathAndReadWriteExecPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CheckCgroupSucceedsIfCgroupv2PathAndReadWriteExecPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfAlreadyExists
[       OK ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfAlreadyExists (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfAncestorCgroupDoesNotExist
[       OK ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfAncestorCgroupDoesNotExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfOnlyReadPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfOnlyReadPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfOnlyReadWritePermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CreateCgroupFailsIfOnlyReadWritePermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.CreateCgroupSucceedsIfParentExistsAndReadWriteExecPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.CreateCgroupSucceedsIfParentExistsAndReadWriteExecPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersFailsIfCgroupDoesNotExist
[       OK ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersFailsIfCgroupDoesNotExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersFailsIfReadWriteButNotExecutePermissions
[       OK ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersFailsIfReadWriteButNotExecutePermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersSucceedsWithCPUAndMemoryControllersOnBaseCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersSucceedsWithCPUAndMemoryControllersOnBaseCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersSucceedsWithNoAvailableControllers
[       OK ] SysFsCgroupDriverIntegrationTest.GetAvailableControllersSucceedsWithNoAvailableControllers (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfSourceDoesntExist
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfSourceDoesntExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfDestDoesntExist
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfDestDoesntExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForSource
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForSource (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForDest
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForDest (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForAncestor
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesFailsIfNotReadWriteExecPermissionsForAncestor (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesSucceedsWithCorrectPermissionsAndValidCgroups
[       OK ] SysFsCgroupDriverIntegrationTest.MoveAllProcessesSucceedsWithCorrectPermissionsAndValidCgroups (15 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfReadOnlyPermissionsForCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfReadOnlyPermissionsForCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfReadWriteOnlyPermissionsForCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfReadWriteOnlyPermissionsForCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfCgroupDoesNotExist
[       OK ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfCgroupDoesNotExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfControllerNotAvailableForCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.EnableControllerFailsIfControllerNotAvailableForCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfControllerNotEnabled
[       OK ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfControllerNotEnabled (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfReadOnlyPermissionsForCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfReadOnlyPermissionsForCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfReadWriteOnlyPermissionsForCgroup
[       OK ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfReadWriteOnlyPermissionsForCgroup (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfCgroupDoesNotExist
[       OK ] SysFsCgroupDriverIntegrationTest.DisableControllerFailsIfCgroupDoesNotExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.EnableAndDisableControllerSucceedWithCorrectInputAndPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.EnableAndDisableControllerSucceedWithCorrectInputAndPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfCgroupDoesntExist
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfCgroupDoesntExist (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfReadOnlyPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfReadOnlyPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfReadWriteOnlyPermissions
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfReadWriteOnlyPermissions (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfConstraintNotSupported
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfConstraintNotSupported (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfControllerNotEnabled
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfControllerNotEnabled (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfInvalidConstraintValue
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintFailsIfInvalidConstraintValue (0 ms)
[ RUN      ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintSucceeds
[       OK ] SysFsCgroupDriverIntegrationTest.AddResourceConstraintSucceeds (0 ms)
[----------] 36 tests from SysFsCgroupDriverIntegrationTest (21 ms total)

[----------] Global test environment tear-down
[==========] 36 tests from 1 test suite ran. (21 ms total)
[  PASSED  ] 36 tests.
Starting integration test fixture with:
  ACTION=teardown
  ROOT_CGROUP=/sys/fs/cgroup
  BASE_CGROUP=/sys/fs/cgroup/testing.ydOxJ
  TEST_CGROUP=/sys/fs/cgroup/testing.ydOxJ/test
  UNPRIV_USER=cgroup-tester
Running ACTION: teardown
Looking for files to backup/remove ...
Removing files ...
Removing user `cgroup-tester' ...
Warning: group `cgroup-tester' has no more members.
Done.
Deleted unprivilged user cgroup-tester.
Updated -cpu -memory controllers for /sys/fs/cgroup/testing.ydOxJ/test/cgroup.subtree_control
Updated -cpu -memory controllers for /sys/fs/cgroup/testing.ydOxJ/cgroup.subtree_control
Updated -cpu -memory controllers for /sys/fs/cgroup/cgroup.subtree_control
Moved 4 procs from /sys/fs/cgroup/testing.ydOxJ/leaf/cgroup.procs to /sys/fs/cgroup/cgroup.procs.
Deleted /sys/fs/cgroup/testing.ydOxJ/test
Deleted /sys/fs/cgroup/testing.ydOxJ/leaf
Deleted /sys/fs/cgroup/testing.ydOxJ
Teardown successful.

israbbani and others added 25 commits July 24, 2025 20:39
to perform cgroup operations.

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
instead of clone for older kernel headers < 5.7 (which is what we have
in CI)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @israbbani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've added a comprehensive suite of integration tests for the cgroup sysfs driver. These tests validate the driver's ability to interact correctly with the Linux cgroupv2 filesystem, covering essential operations like cgroup creation, process management within cgroups, and the application of resource constraints. The new tests ensure the robustness and reliability of our cgroup management functionalities.

Highlights

  • New Integration Test Target: A new ray_cc_test target, sysfs_cgroup_driver_integration_test, has been added to the build system, enabling the execution of the new integration test suite.
  • Enhanced Cgroup Test Utilities: Introduced TempCgroupDirectory for creating and managing temporary cgroup directories, along with StartChildProcessInCgroup and TerminateChildProcessAndWaitForTimeout utilities for robust process management within these test cgroups.
  • Comprehensive Driver Functionality Testing: The new integration tests provide extensive coverage for the SysFsCgroupDriver, validating its core functionalities such as CheckCgroupv2Enabled, CheckCgroup permissions, CreateCgroup, GetAvailableControllers, MoveAllProcesses, EnableController, DisableController, and AddResourceConstraint under various success and failure conditions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@israbbani israbbani changed the title [core] Adding integration tests for the cgroup sysfs driver. [core] (cgroups 2/n) adding integration tests for the cgroup sysfs driver. Jul 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive suite of integration tests for the cgroup sysfs driver. The tests are well-structured and cover a wide range of scenarios, including permission checks, controller management, and resource constraints. The test setup, which involves migrating processes to a leaf cgroup to satisfy the 'no internal processes' constraint, is particularly well-thought-out.

My review includes suggestions for improving code quality and maintainability, such as avoiding throwing exceptions from destructors, improving the random string generation utility, and using more robust methods for conditional compilation and test configuration instead of preprocessor macros. I've also pointed out a minor inconsistency in buffer sizes.

fix CI.

Signed-off-by: irabbani <irabbani@anyscale.com>
@israbbani israbbani added the go add ONLY when ready to merge, run all tests label Jul 30, 2025
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Aug 28, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please write instructions somewhere discoverable for how someone should run these locally.

- integration tests get their own directory
- using bazel supported platforms instead of a no_windows tag (which we
  can probably deprecate with this flag)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
@edoakes edoakes enabled auto-merge (squash) September 3, 2025 21:53
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
@jjyao jjyao merged commit edbbc94 into master Sep 4, 2025
5 checks passed
@jjyao jjyao deleted the irabbani/cgroups-2 branch September 4, 2025 04:54
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
…iver. (ray-project#55063)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
…iver. (ray-project#55063)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
…iver. (ray-project#55063)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
edoakes added a commit that referenced this pull request Sep 24, 2025
…cation cgroup (#56549)

This PR stacks on #56522 .

For more details about the resource isolation project see
#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in #55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…cation cgroup (ray-project#56549)

This PR stacks on ray-project#56522 .

For more details about the resource isolation project see
ray-project#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in ray-project#55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…cation cgroup (#56549)

This PR stacks on #56522 .

For more details about the resource isolation project see
#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in #55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…iver. (#55063)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…cation cgroup (#56549)

This PR stacks on #56522 .

For more details about the resource isolation project see
#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in #55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…cation cgroup (ray-project#56549)

This PR stacks on ray-project#56522 .

For more details about the resource isolation project see
ray-project#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in ray-project#55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…iver. (ray-project#55063)

Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…cation cgroup (ray-project#56549)

This PR stacks on ray-project#56522 .

For more details about the resource isolation project see
ray-project#54703.

This PR the makes the raylet move runtime_env and dashboard agents into
the system cgroup. Workers are now spawned inside the application
cgroup.

It introduces the following:
* I've added a new target `raylet_cgroup_types` which defines the type
used all functions that need to add a process to a cgroup.
* A new parameter is added to `NodeManager`, `WorkerPool`,
`AgentManager`, and `Process` constructors. The parameter is a callback
that will use the CgroupManager to add a process to the respective
cgroup.
* The callback is created in `main.cc`.
* `main.cc` owns CgroupManager because it needs to outlive the
`WorkerPool`.
* `process.c` calls the callback after fork() in the child process so
nothing else can happen in the forked process before it's moved into the
correct cgroup.
* Integration tests in python for end-to-end testing of cgroups with
system and application processes moved into their respective cgroups.
The tests are inside
`python/ray/tests/resource_isolation/test_resource_isolation_integration.py`
and have similar setup/teardown to the C++ integration tests introduced
in ray-project#55063.

---------

Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants