-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131
Conversation
9a07272
to
823bb76
Compare
This comment has been minimized.
This comment has been minimized.
c28a49a
to
5db6db3
Compare
5db6db3
to
e551d66
Compare
7b0ae0b
to
04f43d3
Compare
This comment has been minimized.
This comment has been minimized.
2a71e3d
to
65faece
Compare
This comment has been minimized.
This comment has been minimized.
3ed90ff
to
e3d0a62
Compare
It does not make sense to calculate slice and unit 10+ times. Move those out of the loop. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As ExpandSlice("system.slice") returns "/system.slice", there is no need to call it for such paths (and the slash will be added by path.Join anyway). The same optimization was already done for v2 as part of commit bf15cc9. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Now fs.go is not very readable as its public API functions are intermixed with internal stuff about getting cgroup paths. Move that out to paths.go, without changing any code. Same for the tests -- move paths-related tests to paths_test.go. This commit is separate to make the review easier. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case c.Path is set, c.Name and c.Parent are not used, and so calls to utils.CleanPath are entirely unnecessary. Move them to inside of the "if" statement body. Get rid of the intermediate cgPath variable, it is not needed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As this is called from the Apply() method, it's a natural name. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Dismantle and remove struct cgroupData. It contained three unrelated entities (cgroup paths, pid, and resources), and made the code harder to read. Most importantly, though, it is not needed. Now, subsystems' Apply methods take path, resources, and pid. To a reviewer -- the core of the changes is in fs.go and paths.go, the rest of it is adapting to the new signatures and related test changes. 2. Dismantle and remove struct cgroupTestUtil. This is a followup to the previous item -- since cgroupData is gone, there is nothing to hold in cgroupTestUtil. The change itself is very small (see util_test.go), but this patch is big because of it -- mostly because we had to replace helper.cgroup.Resources with &config.Resources{}. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Separate path initialization logic from Apply to initPaths, and call initPaths from NewManager, so: - we can error out early (in NewManager rather than Apply); - always have m.paths available (e.g. in Destroy or Exists). - do not unnecessarily call subsysPath from Apply in case the paths were already provided. 2. Add a check for non-nil cgroups.Resources to NewManager, since initPaths, as well as some controller's Apply methods, need it. 3. Move cgroups.Resources.Unified check from Apply to NewManager, so we can error out early (same check exists in Set). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is already documented but I guess more explanations won't hurt. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This way we - won't re-initialize the paths if they were provided; - will always have paths ready for every method. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
cgName and cgParent are only used when cgPath is empty, so move their cleaning to the body of the appropriate "if" statement. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes the same issue as e.g. commit 4f8ccc5 but in a more universal way. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Many operations require fsMgr, so let's create it right in NewUnifiedManager and reuse. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Cgroup controllers should never panic, and yet sometimes they do. Add a unit test to check that controllers never panic when called with nil arguments and/or resources, and fix a few found cases. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
runc delete -f is not working for a paused container, since in cgroup v1 SIGKILL does nothing if a process is frozen (unlike cgroup v2, in which you can kill a frozen process with a fatal signal). Theoretically, we only need this for v1, but doing it for v2 as well is OK. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, if a container is paused (or its cgroup is frozen), runc exec just hangs, and it is not obvious why. Refuse to exec in a frozen container. Add a test case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently runc allows multiple containers to share the same cgroup (for example, by having the same cgroupPath in config.json). While such shared configuration might be OK, there are some issues: - When each container has its own resource limits, the order of containers start determines whose limits will be effectively applied. - When one of containers is paused, all others are paused, too. - When a container is paused, any attempt to do runc create/run/exec end up with runc init stuck inside a frozen cgroup. - When a systemd cgroup manager is used, this becomes even worse -- such as, stop (or even failed start) of any container results in "stopTransientUnit" command being sent to systemd, and so (depending on unit properties) other containers can receive SIGTERM, be killed after a timeout etc. Any of the above may lead to various hard-to-debug situations in production (runc init stuck, cgroup removal error, wrong resource limits, init not reaping zombies etc.). One obvious solution is to refuse a non-empty cgroup when starting a new container. This is exactly what this commit implements. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes a container cgroup already exists but is frozen. When this happens, runc init hangs, and it's not clear what is going on. Refuse to run in a frozen cgroup; add a test case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
76822a1
to
79ccdaa
Compare
OK it seems that I have to split this one into a few smaller PRs. Alas, the biggest part is cgroup manager refactoring which seemingly can't be split. |
Regarding this part:
A long time ago there were discussions of adding |
I've found traces of this when working on #3136, and yes, this seems to be not implemented. Looking into moby/moby#18654 it seems that it was suggested to use a common cgroup parent for containers instead of this (ultimately this is what the "pod" concept, introduced later, does). Yet I don't know if anyone actually shares a cgroup for multiple containers in the wild, and if yes, it can be a breaking change. I still think it is needed as such setups introduce a number of issues, outlined above as well as in #3132, and are therefore broken. I guess unless we try this out we'll never know if this is used in the wild, and how often. In case this is used often, we can revert, in case it's used rarely, we can add a mechanism to skip the check (an env var, a flag file, or something). |
Yeah I agree the fact we haven't gotten any bug reports about how broken it is leads me to believe nobody is using it. |
OK, this is being split into a few more digestible PRs:
I. cgroup manager refactoring
Very high level overview:
NewManager
) to return errors;(and now we can use a new instance for e.g.
Destroy
orExists
,which was not possible before);
logic of choosing a cgroup manager with a simple
manager.New()
call.Resources==nil
).This helps further commits, and makes cgroup manager easier to use
from e.g. Kubernetes.
Fixes: #3177
Fixes: #3178
II. runc run/create: refuse non-empty cgroup
Currently runc allows multiple containers to share the same cgroup (for
example, by having the same
cgroupPath
in config.json). While suchshared configuration might be OK, there are some issues:
When each container has its own resource limits, the order of
containers start determines whose limits will be effectively applied.
When one of containers is paused, all others are paused, too.
When a container is paused, any attempt to do runc create/run/exec
end up with runc init stuck inside a frozen cgroup.
When a systemd cgroup manager is used, this becomes even worse -- such
as, stop (or even failed start) of any container results in
"stopTransientUnit" command being sent to systemd, and so (depending
on unit properties) other containers can receive SIGTERM, be killed
after a timeout etc.
All this may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, container
init not working as zombie reaper, etc).
One obvious solution is to require a non-existent or empty cgroup
for the new container, and fail otherwise.
This is exactly what this PR implements.
III. runc delete -f: don't hang on a paused container
runc delete -f used to hang if a container is paused (on cgroup v1).
Fix this, add a test case.
Fixes: #3134
IV. runc exec: refuse paused container
This bugged me a few times during runc development. A new container is
run, and suddenly runc init is stuck, 🎶 and nothing ever happens, and
I wonder... 🎶
Figuring out that the cause of it is (pre-created) frozen cgroup is not
very obvious (to say at least).
The fix is to add a check that refuses to exec in a paused container.
A (very bad) alternative to that would be to thaw the cgroup.
Implement the fix, add a test case.
V. runc run: refuse cgroup if frozen
Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.
Refuse to run in a frozen cgroup; add a test case.
Fixes: #3132
Proposed changelog entry