-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 2/n) adding integration tests for the cgroup sysfs driver. #55063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to perform cgroup operations. Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
instead of clone for older kernel headers < 5.7 (which is what we have in CI) Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
…irabbani/cgroups-1
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
…irabbani/cgroups-1
Signed-off-by: irabbani <irabbani@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @israbbani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've added a comprehensive suite of integration tests for the cgroup sysfs driver. These tests validate the driver's ability to interact correctly with the Linux cgroupv2 filesystem, covering essential operations like cgroup creation, process management within cgroups, and the application of resource constraints. The new tests ensure the robustness and reliability of our cgroup management functionalities.
Highlights
- New Integration Test Target: A new
ray_cc_testtarget,sysfs_cgroup_driver_integration_test, has been added to the build system, enabling the execution of the new integration test suite. - Enhanced Cgroup Test Utilities: Introduced
TempCgroupDirectoryfor creating and managing temporary cgroup directories, along withStartChildProcessInCgroupandTerminateChildProcessAndWaitForTimeoututilities for robust process management within these test cgroups. - Comprehensive Driver Functionality Testing: The new integration tests provide extensive coverage for the
SysFsCgroupDriver, validating its core functionalities such asCheckCgroupv2Enabled,CheckCgrouppermissions,CreateCgroup,GetAvailableControllers,MoveAllProcesses,EnableController,DisableController, andAddResourceConstraintunder various success and failure conditions.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a comprehensive suite of integration tests for the cgroup sysfs driver. The tests are well-structured and cover a wide range of scenarios, including permission checks, controller management, and resource constraints. The test setup, which involves migrating processes to a leaf cgroup to satisfy the 'no internal processes' constraint, is particularly well-thought-out.
My review includes suggestions for improving code quality and maintainability, such as avoiding throwing exceptions from destructors, improving the random string generation utility, and using more robust methods for conditional compilation and test configuration instead of preprocessor macros. I've also pointed out a minor inconsistency in buffer sizes.
src/ray/common/cgroup2/test/sysfs_cgroup_driver_integration_test.cc
Outdated
Show resolved
Hide resolved
src/ray/common/cgroup2/test/sysfs_cgroup_driver_integration_test.cc
Outdated
Show resolved
Hide resolved
fix CI. Signed-off-by: irabbani <irabbani@anyscale.com>
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please write instructions somewhere discoverable for how someone should run these locally.
- integration tests get their own directory - using bazel supported platforms instead of a no_windows tag (which we can probably deprecate with this flag) Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: irabbani <irabbani@anyscale.com>
…irabbani/cgroups-2
…iver. (ray-project#55063) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>
…iver. (ray-project#55063) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…iver. (ray-project#55063) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
…cation cgroup (#56549) This PR stacks on #56522 . For more details about the resource isolation project see #54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in #55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Marco Stephan <marco@magic.dev>
…cation cgroup (#56549) This PR stacks on #56522 . For more details about the resource isolation project see #54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in #55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…iver. (#55063) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…cation cgroup (#56549) This PR stacks on #56522 . For more details about the resource isolation project see #54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in #55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…iver. (ray-project#55063) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…cation cgroup (ray-project#56549) This PR stacks on ray-project#56522 . For more details about the resource isolation project see ray-project#54703. This PR the makes the raylet move runtime_env and dashboard agents into the system cgroup. Workers are now spawned inside the application cgroup. It introduces the following: * I've added a new target `raylet_cgroup_types` which defines the type used all functions that need to add a process to a cgroup. * A new parameter is added to `NodeManager`, `WorkerPool`, `AgentManager`, and `Process` constructors. The parameter is a callback that will use the CgroupManager to add a process to the respective cgroup. * The callback is created in `main.cc`. * `main.cc` owns CgroupManager because it needs to outlive the `WorkerPool`. * `process.c` calls the callback after fork() in the child process so nothing else can happen in the forked process before it's moved into the correct cgroup. * Integration tests in python for end-to-end testing of cgroups with system and application processes moved into their respective cgroups. The tests are inside `python/ray/tests/resource_isolation/test_resource_isolation_integration.py` and have similar setup/teardown to the C++ integration tests introduced in ray-project#55063. --------- Signed-off-by: Ibrahim Rabbani <irabbani@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR adds integration tests for the
SysFsCgroupDriver.This PR stacks on [#54898]. For more details about the resource isolation project see #54703.
The design goals of the tests are:
The tests make the following assumptions:
sysfs_cgroup_driver_integration_test_entrypoint.shhas read-write access to the ROOT_CGROUP (which is defined by a variable insidesysfs_cgroup_driver_integration_test_entrypoint.sh)The tests created scoped cgroup hierarchy to simplify cleanup and isolate test runs from each other. This looks like:
Where
The tests follow a peculiar CI pattern:
sh_testthen, no other user has access to the bazel workspace and we cannot run the setup/teardown as a privileged user and the cpp tests as an unprivileged user.Example output from a test run: