-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cgroups] Extended cgroup support #21322
[Cgroups] Extended cgroup support #21322
Conversation
Opened this as a draft to gather feedback before I go ahead and rebase/push tests. We've been using this change for a while now and have a couple more follow prs relative to resource utilization collection, which I'll open if this gets enough traction. cc @larsrc-google and @zhengwei143 who have worked in this area in the past. |
I learned that Lars is no longer working in this area, so cc @wilwell as well. |
@@ -234,7 +234,8 @@ static void ParseCommandLine(unique_ptr<vector<char *>> args) { | |||
break; | |||
case 'C': | |||
ValidateIsAbsolutePath(optarg, args->front(), static_cast<char>(c)); | |||
opt.cgroups_dir.assign(optarg); | |||
opt.cgroups_dirs.emplace_back(optarg); | |||
opt.writable_files.emplace_back(optarg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to make the cgroups dir writable? We wouldn't want spawns to be able to write to the cgroups dir, circumventing any existing cgroup limits placed on the spawn by Bazel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I believe this comes from the original implementation, I wasn't sure why it was added to the writable dirs so I kept it around. Will remove as I rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, this was removed recently in da8c080 which clamps down on the write access to the cgroup dir (previously the spawn could actually modify its own limits).
// We cannot leave the cgroups around and delete them only when we delete the sandboxes | ||
// because linux has a hard limit of 65535 memory controllers. | ||
// Ref. https://github.com/torvalds/linux/blob/58d4e450a490d5f02183f6834c12550ba26d3b47/include/linux/memcontrol.h#L69 | ||
cgroups.remove(context.getId()).ifPresent(VirtualCGroup::delete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting, I don't think we've encountered this in our internal builds presumably because we've not hit this limit for the number of sandboxed spawns created during a build. (I think this is fine though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, took some time for this one to manifest.
reporter.handle(Event.warn("Found non-writable cgroup v1 at " + cgroup)); | ||
continue; | ||
} | ||
cgroup = cgroup.resolve("bazel_" + ProcessHandle.current().pid() + ".slice"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong - I believe this comes from systemd
with the concepts of scopes and slices. From what I checked, I don't think its mandatory for either v1 / v2 to use systemd; but we don't we don't do a good job in CgroupsInfo
distinguishing this atm either (the current implementation assumes systemd for v2 and without for v1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this comes from systemd. In general the name can be arbitrary, regardless of it being using systemd or not, so I kept the syntax. I don't have a strong opinion about that, so let me know if you have any preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I think we should follow convention to the best of our ability - I think we can infer this from Mount.path() == "/sys/fs/cgroup"
(for systemd
),
return !getPath().resolve("cgroup.controllers").toFile().exists(); | ||
} | ||
|
||
static <T extends Controller> T getDefault(Class<T> clazz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to implement a simple NO-OP class to handle the case when cgroups are not available.
See:
public static class InvalidCgroupsInfo extends CgroupsInfo { |
Reason being, we do want some control in the code when working with controllers where cgroups are not available (either due to permissions or setup) to handle this gracefully - e.g. ignore the flag entirely with a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was considering adding a flag for the user to decide (e.g. --cgroup_fail_if_unavailable
), but can be discussed/added separately
Overall I do like many aspects in the structuring of this code w.r.t cgroups, but I would like this to eventually be upstreamed into I appreciate the effort here in modelling the classes to reflect the different implementations in hierarchy between V1 and V2. Some overall directions to work towards if you're attempting to upstream this into
Left some side-comments throughout the code just to start a discussion first (don't need to spend time working on them atm). |
@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) { | |||
public boolean sandboxHermeticTmp; | |||
|
|||
@Option( | |||
name = "experimental_sandbox_memory_limit_mb", | |||
defaultValue = "0", | |||
name = "experimental_sandbox_limits", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a separate mechanism for sandboxed spawns, I wonder if it's better to unify this with the specified resource_set
of the sandboxed action. Then just use --experimental_sandbox_enforce_resources_regexp
to control for which actions to enforce this on.
Is there a reason to have this as a separate mechanism?
cc @wilwell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of reasons why I find this flag very useful:
- You cannot override the resource set of arbitrary actions, only tests (see Consistent resource management across actions and resource types #19679) and for most actions the default resource set is underestimated
- I found it particularly useful to pass
--experimental_sandbox_enforce_resources_regexp=DO_NOT_MATCH --experimental_sandbox_limits=cpu=<n> <target>
to test<target>
with a different limit other then the one specified at the target level. Without this flag, I'd have to go and change the BUILD file, which makes this harder to iterate on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot override the resource set of arbitrary actions
By this I assume you mean actions outside of your control - e.g. in build rules not owned by your project / default starlark resources.
for most actions the default resource set is underestimated
While I don't think fast testing / iteration is a good reason to introduce this (whatever value specified here should be relatively stable without users having to tinker frequently), I can see a legitimate case for this in the event that resources are underestimated. Since the ResourceManager
makes reservations purely off of estimated values (and kind-of makes a claim that the action will have some minimum amount of resources1), this serves as a mechanism to upper bound resources.
1 Not entirely true since Bazel assumes it can use all the resources available to the machine and doesn't actively monitor actually available resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this as a separate flag that overrides --experimental_sandbox_memory_limit_mb
(and document that behavior) so we don't make this a breaking change. Internally I don't believe we use this flag, but I'm not sure about external users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A single CgroupsInfo should encapsulate all controllers regardless of the implementation (v1 or v2), even if they have different paths.
Yes, this is opaque to all the users of VirtualCgroup
, which see all the controller as if they come from the same (virtual) cgroup :)
- Try to separate as much v1 / v2 cgroups creation logic as possible for future maintainability.
I think this is already the case, with the exception of the VirtualCgroup class, which has to expose both v1 and v2 cgroups transparently so has to deal a bit with their differences. I can try and break out some if/else, but I don't think it will be possible to break it out much more than that.
- We want to be able to handle failures gracefully, ignore errors with a warning; this is quite susceptible to permissions issues, e.g. if the Bazel process were to start up in a root (user) cgroup, we won't have the proper write permissions to create any sub-cgroups. Since this is mainly build performance related, I would rather not break the build due to such issues.
Makes sense. I've also faced similar challenges in setting up the right enviroment for this feature to work. I've been considering adding a new startup option that will have the client start the server in the right cgroup (something on the lines of bazel --cgroup-parent
/foo/bar` build ...), so that you just need to worry about ensuring that user-writable cgroup exists. WDYT?
Left some side-comments throughout the code just to start a discussion first (don't need to spend time working on them atm).
Thank you! I'll try rebasing and updating the code to match the current state of the cgroup feature to make the conversation easier!
// We cannot leave the cgroups around and delete them only when we delete the sandboxes | ||
// because linux has a hard limit of 65535 memory controllers. | ||
// Ref. https://github.com/torvalds/linux/blob/58d4e450a490d5f02183f6834c12550ba26d3b47/include/linux/memcontrol.h#L69 | ||
cgroups.remove(context.getId()).ifPresent(VirtualCGroup::delete); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, took some time for this one to manifest.
@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) { | |||
public boolean sandboxHermeticTmp; | |||
|
|||
@Option( | |||
name = "experimental_sandbox_memory_limit_mb", | |||
defaultValue = "0", | |||
name = "experimental_sandbox_limits", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of reasons why I find this flag very useful:
- You cannot override the resource set of arbitrary actions, only tests (see Consistent resource management across actions and resource types #19679) and for most actions the default resource set is underestimated
- I found it particularly useful to pass
--experimental_sandbox_enforce_resources_regexp=DO_NOT_MATCH --experimental_sandbox_limits=cpu=<n> <target>
to test<target>
with a different limit other then the one specified at the target level. Without this flag, I'd have to go and change the BUILD file, which makes this harder to iterate on.
reporter.handle(Event.warn("Found non-writable cgroup v1 at " + cgroup)); | ||
continue; | ||
} | ||
cgroup = cgroup.resolve("bazel_" + ProcessHandle.current().pid() + ".slice"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this comes from systemd. In general the name can be arbitrary, regardless of it being using systemd or not, so I kept the syntax. I don't have a strong opinion about that, so let me know if you have any preference
@@ -234,7 +234,8 @@ static void ParseCommandLine(unique_ptr<vector<char *>> args) { | |||
break; | |||
case 'C': | |||
ValidateIsAbsolutePath(optarg, args->front(), static_cast<char>(c)); | |||
opt.cgroups_dir.assign(optarg); | |||
opt.cgroups_dirs.emplace_back(optarg); | |||
opt.writable_files.emplace_back(optarg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I believe this comes from the original implementation, I wasn't sure why it was added to the writable dirs so I kept it around. Will remove as I rebase.
return !getPath().resolve("cgroup.controllers").toFile().exists(); | ||
} | ||
|
||
static <T extends Controller> T getDefault(Class<T> clazz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was considering adding a flag for the user to decide (e.g. --cgroup_fail_if_unavailable
), but can be discussed/added separately
That's what I love about your solution :) Really abstracts away the nitty gritty of cgroups from the API.
W.r.t cgroups creation I think that I think moving forward, we could work the concept of
I've actually been wanting to do that as well! One idea was using |
37fda84
to
c668e0e
Compare
I tried to get it working with |
7e45094
to
dbf3b51
Compare
Will review this next week - have more pressing work upgrading Bazel's Java language level to 21 (which will be very soon!). Will also need to get this tested internally since we are relying on cgroups features in production but this refactor completely revamps the implementation - so it might take a while. Also considering whether we need to keep the original implementation and guard this behind a flag, will make a decision after I have a closer look at the code. |
@@ -306,4 +306,7 @@ public enum WorkerProtocolFormat { | |||
* segment from the paths of all inputs and outputs. | |||
*/ | |||
public static final String SUPPORTS_PATH_MAPPING = "supports-path-mapping"; | |||
|
|||
/** Disables cgroups for a sandbox spawn */ | |||
public static final String NO_SUPPORTS_CGROUPS = "no-supports-cgroups"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for having a per-spawn disabling of cgroups? (Not saying that it isn't a good idea, I'm just interested to know).
Can we split this out into another PR? It is diverging from the original PR of adding a new cgroups implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for having a per-spawn disabling of cgroups? (Not saying that it isn't a good idea, I'm just interested to know).
Gives a bit more control over where this feature is applied. This is not only per-spawn, but also per-action (rules owners can apply this in the rule implementation).
Can we split this out into another PR? It is diverging from the original PR of adding a new cgroups implementation.
Done
@@ -35,15 +36,20 @@ public static CgroupsInfoCollector instance() { | |||
return instance; | |||
} | |||
|
|||
public ResourceSnapshot collectResourceUsage(Map<Long, CgroupsInfo> pidToCgroups, Clock clock) { | |||
public ResourceSnapshot collectResourceUsage(Map<Long, VirtualCGroup> pidToCgroups, Clock clock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace VirtualCgroup
with some interface that implements methods to retrieve the memory usage from a cgroup object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not remove these files first and introduce a flag to flip between the two cgroups implementations. We have internal prod systems relying on the current code and I would rather flip the behavior with a flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) { | |||
public boolean sandboxHermeticTmp; | |||
|
|||
@Option( | |||
name = "experimental_sandbox_memory_limit_mb", | |||
defaultValue = "0", | |||
name = "experimental_sandbox_limits", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this as a separate flag that overrides --experimental_sandbox_memory_limit_mb
(and document that behavior) so we don't make this a breaking change. Internally I don't believe we use this flag, but I'm not sure about external users.
converter = RegexPatternConverter.class, | ||
help = | ||
"If true, actions whose mnemonic matches the input regex" | ||
+ " will have their resources request enforced as limits, ovverriding" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/ovverriding/overriding/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
static public VirtualCGroup NULL = new AutoValue_VirtualCGroup(null, null, ImmutableSet.of()); | ||
|
||
public static VirtualCGroup create() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename this to something like createRootCgroup
so that we can distinguish what its actually creating? Also, use package private access since this is the exposed API here should be getInstance
. (same for the methods below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
this.paths().stream().map(Path::toFile).filter(File::exists).forEach(File::delete); | ||
} | ||
|
||
public VirtualCGroup child(String name) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/child/createChild/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/main/java/com/google/devtools/build/lib/sandbox/cgroups/VirtualCGroup.java
Show resolved
Hide resolved
|
||
public interface Controller { | ||
default boolean isLegacy() throws IOException { | ||
return !getPath().resolve("cgroup.controllers").toFile().exists(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make a filesystem call here everytime we want to check this method? Since the controllers are already specialized into Legacy
and Unified
, can we just override this there with true
and false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out, updated
@@ -176,7 +184,7 @@ public void buildStarting(BuildStartingEvent event) { | |||
} | |||
|
|||
// Override the flag value if we can't actually use cgroups so that we at least fallback to ps. | |||
boolean useCgroupsOnLinux = options.useCgroupsOnLinux && CgroupsInfo.isSupported(); | |||
boolean useCgroupsOnLinux = options.useCgroupsOnLinux && rootCgroup.memory() != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's use some method like hasMemoryController()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to change anyways if we want to start collecting cpu usage as well
dbf3b51
to
77a542a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, with a few more minor changes.
I will be on PTO for the next 2-3 weeks, so will be unable to test internally and merge these changes until then. If this is urgent, perhaps @wilwell can help to merge this.
src/main/java/com/google/devtools/build/lib/sandbox/SandboxOptions.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/worker/SingleplexWorker.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/worker/SandboxedWorker.java
Outdated
Show resolved
Hide resolved
(I'm working on cleaning this up internally now). |
a424845
to
c5f0158
Compare
@zhengwei143 do you have an update? |
I managed to clean most things up and am still going through internal review (will need to do more refactoring), that has stalled due to other more urgent work but I hope to pick it up soon. Sorry this has taken so long... |
Update: Will push this through at the start of next week. I was working on replacing |
Thank you @zhengwei143! |
This extends the cgroup support in bazel to the cpu controller. Limits can be specified with the flag
--experimental_sandbox_limits=<resource>=<limit>
. Additionally, it is possible to get finer grained resource limits with--experimental_sandbox_enforce_resources=<regex>
, which will cause all actions whose mnemonic matches the regex to use their allocated resource as limits, particularly useful for tests where resource allocation is controlled by theresources:*:*
tags.