Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cgroup2 mount for rootless case #2818

Merged
merged 6 commits into from
Apr 23, 2021

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Feb 24, 2021

In case of rootless, cgroup2 mount is not possible (see #2158
for more details), so since commit 9c81440 runc bind-mounts the whole
/sys/fs/cgroup into container.

Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.

The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.

Before:

        $ ./runc run aaa
        # find /sys/fs/cgroup/ -type d
        /sys/fs/cgroup
        /sys/fs/cgroup/user.slice
        /sys/fs/cgroup/user.slice/user-1000.slice
        /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
        ...
        # ls -l /sys/fs/cgroup/cgroup.controllers
        -r--r--r--    1 nobody   nogroup          0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
        # wc -w /sys/fs/cgroup/cgroup.procs
        142 /sys/fs/cgroup/cgroup.procs
        # cat /sys/fs/cgroup/memory.current
        cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory

After:

        # find /sys/fs/cgroup/ -type d
        /sys/fs/cgroup/
        # ls -l /sys/fs/cgroup/cgroup.controllers
        -r--r--r--    1 root     root             0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
        # wc -w /sys/fs/cgroup/cgroup.procs
        2 /sys/fs/cgroup/cgroup.procs
        # cat /sys/fs/cgroup/memory.current
        577536

PS found while trying to enable rootless for the test case from #2786

Suggested changelog entry

  • fixed rootless containers to not expose host's /sys/fs/cgroup hierarchy

@kolyshkin
Copy link
Contributor Author

Not sure if this is rc94 of post-1.0 material -- WDYT @cyphar @mrunalp @AkihiroSuda ?

@AkihiroSuda AkihiroSuda added this to the 1.0.0-rc94 milestone Feb 24, 2021
@AkihiroSuda
Copy link
Member

Good fit for rc94 IMO

@kolyshkin kolyshkin force-pushed the rootless-cgroup2-mount branch 2 times, most recently from df16ed8 to 08d657e Compare February 26, 2021 02:47
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Feb 26, 2021

Checked that the added test case is failing without the fix:

$ sudo ROOTLESS_TESTPATH=/cgroups.bats ./tests/rootless.sh
...
not ok 13 runc run (cgroupv2 mount inside container)
# (in test file tests/integration/cgroups.bats, line 296)
#   `[ "$(wc -l <<<"$output")" -eq 1 ]' failed
# runc spec --rootless (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-3210676/runc.4mlpW2/tty/sock test_cgroups_unified (status=0):
# 
# runc exec test_cgroups_unified find /sys/fs/cgroup/ -type d (status=0):
# /sys/fs/cgroup/
# /sys/fs/cgroup/user.slice
# /sys/fs/cgroup/user.slice/user-1000.slice
# /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
# /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/vte-spawn-98b64ba6-eca3-41f7-b7e4-2f94a0613230.scope
# /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/gvfs-goa-volume-monitor.service
# /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/vte-spawn-a5d79e49-117d-4728-b261-ccba710e136d.scope
...

@kolyshkin
Copy link
Contributor Author

OK, I have totally missed the fact that rootless spec removes the /sys/fs/cgroup mount and therefore my code doesn't work (it worked locally because I have /sys/fs/cgroup mount in my spec).

This needs more work.

@kolyshkin kolyshkin force-pushed the rootless-cgroup2-mount branch from 9accc2f to c0b2be6 Compare February 28, 2021 18:57
@kolyshkin
Copy link
Contributor Author

Failure in centos 7 rootful fs test:

not ok 9 runc run (cgroup v1 + unified resources should fail)
(from function teardown_bundle' in file tests/integration/helpers.bash, line 481, from function teardown' in test file tests/integration/cgroups.bats, line 6)
`teardown_bundle' failed
runc spec (status=0):

runc run -d --console-socket /tmp/bats-run-7694/runc.Ujskgr/tty/sock test_cgroups_unified (status=1):
time="2021-02-28T19:06:43Z" level=warning msg="unable to get oom kill count" error="no directory specified for memory.oom_control"
time="2021-02-28T19:06:43Z" level=error msg="unlinkat /tmp/bats-run-7694/runc.Ujskgr/state/test_cgroups_unified/runc.JCDSLu: device or resource busy"
time="2021-02-28T19:06:43Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:365: applying cgroup configuration for process caused: invalid configuration: cannot use unified on cgroup v1"
load container test_cgroups_unified: container "test_cgroups_unified" does not exist
rm: cannot remove ‘/tmp/bats-run-7694/runc.Ujskgr/state/test_cgroups_unified/runc.JCDSLu’: Device or resource busy

Looks unrelated but worrisome.

@kolyshkin
Copy link
Contributor Author

Fedora VM fails to run dnf transaction:

default: Errors during downloading metadata for repository 'fedora-modular':
default: - Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=fedora-modular-33&arch=x86_64 (IP: 209.132.190.2)
default: Error: Failed to download metadata for repo 'fedora-modular': Cannot prepare internal mirrorlist: Status code: 503 for https://mirrors.fedoraproject.org/metalink?repo=fedora-modular-33&arch=x86_64 (IP: 209.132.190.2)

CI restarted

@kolyshkin kolyshkin requested a review from AkihiroSuda March 2, 2021 22:34
@kolyshkin
Copy link
Contributor Author

@cyphar @AkihiroSuda @mrunalp PTAL

@AkihiroSuda
Copy link
Member

I can't repro the issue with the current master (497cd0c).

<ENGINE> run --runtime=runc --cgroupns=private --memory 42m alpine cat /sys/fs/cgroup/memory.max shows 44040192 as expected. (<ENGINE> = rootless docker, podman, nerdctl)

@AkihiroSuda
Copy link
Member

@kolyshkin PTAL↑

@kolyshkin kolyshkin removed this from the 1.0.0-rc94 milestone Apr 4, 2021
@kolyshkin
Copy link
Contributor Author

Need more time to look into this. Removed rc94 milestone for now.

@kolyshkin kolyshkin marked this pull request as draft April 12, 2021 18:40
@kolyshkin
Copy link
Contributor Author

Need more time to have another look into this; converting to draft for now.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Apr 14, 2021

@AkihiroSuda I guess checked the upper layer tools (podman/docker/nerdctl) generate the correct config.json for this case, while runc spec does not does not add userns, while runc does, this is why I am seeing this with bare runc and you don't see it with podman/docker/nerdctl.

There are other similar cases (for example, in case we use host pidns, the /proc mounts needs to be a bind mount, not the fstype=proc one), and ultimately the question is -- do we want runc to handle this (modify/interpret the mounts accordingly), or do we want the upper layers to provide the correct OCI runtime spec for these cases (the third way, I guess, is to overload runc spec with kludges).

I think this was discussed before and @crosbymichael was for literal spec interpretation (i.e. runc should not adopt mounts), and yet in some cases runc already does that (don't have an example from the top of my mind, will look it up later).

@kolyshkin
Copy link
Contributor Author

@AkihiroSuda is it crucial that we use upper-level tools to repro? I don't know how to correctly make them add userns into OCI spec (so we can hit the codepath added by 9c81440).

I can easily reproduce this with bare runc (see last two commits).

Here is a full manual repro (Fedora 33, cgroup v2):

[kir@kir-rhat runc]$ mkdir foo
[kir@kir-rhat runc]$ cd foo
[kir@kir-rhat foo]$ mkdir rootfs
[kir@kir-rhat foo]$ eval `../tests/integration/get-images.sh`
[kir@kir-rhat foo]$ tar -C rootfs -xf $BUSYBOX_IMAGE
[kir@kir-rhat foo]$ ../runc spec --rootless
[kir@kir-rhat foo]$ ../runc run xx4
/ # ls -l /sys/fs/cgroup
total 0
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 cgroup.controllers
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 cgroup.max.depth
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 cgroup.max.descendants
-rw-rw-r--    1 nobody   nobody           0 Apr  6 22:02 cgroup.procs
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 cgroup.stat
-rw-r--r--    1 nobody   nobody           0 Apr 15 00:13 cgroup.subtree_control
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 cgroup.threads
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 cpu.pressure
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 cpu.stat
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 cpuset.cpus.effective
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 cpuset.mems.effective
drwxr-xr-x    2 nobody   nobody           0 Mar  5 00:53 init.scope
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 io.cost.model
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 io.cost.qos
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 io.pressure
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 io.stat
drwxr-xr-x    2 nobody   nobody           0 Apr  6 20:35 machine.slice
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 memory.numa_stat
-rw-r--r--    1 nobody   nobody           0 Apr 19 23:56 memory.pressure
-r--r--r--    1 nobody   nobody           0 Apr 19 23:56 memory.stat
drwxr-xr-x    3 nobody   nobody           0 Mar 19 21:03 pod_123.slice
drwxr-xr-x    2 nobody   nobody           0 Mar 10 23:20 sdet35
drwxr-xr-x  138 nobody   nobody           0 Apr 19 23:51 system.slice
drwxr-xr-x    2 nobody   nobody           0 Apr 14 23:31 test
drwxr-xr-x    3 nobody   nobody           0 Apr 15 17:45 user.slice
drwxr-xr-x    3 nobody   nobody           0 Mar  9 03:04 woo
/ #

We see all the cgroups, while we should only see ours.

Another repro is via our integration tests -- just modify this test case (from c3ffd2e#diff-df7d610d8608d0748bfcbaab39894c9e189211cddf54df5a955811287d199d5d) so it can be run as rootless (same as in one of this PR commits):

diff --git a/tests/integration/cgroups.bats b/tests/integration/cgroups.bats
index c5a927bd..2c49dfef 100644
--- a/tests/integration/cgroups.bats
+++ b/tests/integration/cgroups.bats
@@ -138,7 +138,8 @@ function setup() {
 }
 
 @test "runc run (blkio weight)" {
-       requires root cgroups_v2
+       requires cgroups_v2
+       [[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup
 
        set_cgroups_path
        update_config '.linux.resources.blockIO |= {"weight": 750}'

and run it in a rootless mode:

[kir@kir-rhat runc]$ sudo make localrootlessintegration ROOTLESS_TESTPATH=/cgroups.bats
....
[02] run rootless tests ... (cgroup)
++ cat /sys/fs/cgroup/cgroup.controllers
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +cpuset
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +cpu
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +io
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +memory
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +hugetlb
+ for f in $(cat "$CGROUP_MOUNT/cgroup.controllers")
+ echo +pids
+ set +x
path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/var/lib/snapd/snap/bin
1..10
ok 1 runc create (no limits + no cgrouppath + no permission) succeeds
ok 2 runc create (rootless + no limits + cgrouppath + no permission) fails with permission error # skip test requires rootless_no_cgroup
ok 3 runc create (rootless + limits + no cgrouppath + no permission) fails with informative error # skip test requires rootless_no_cgroup
ok 4 runc create (limits + cgrouppath + permission on the cgroup dir) succeeds
ok 5 runc exec (limits + cgrouppath + permission on the cgroup dir) succeeds
ok 6 runc exec (cgroup v2 + init process in non-root cgroup) succeeds # skip test requires root
ok 7 runc run (cgroup v1 + unified resources should fail) # skip test requires root
not ok 8 runc run (blkio weight)
# (in test file tests/integration/cgroups.bats, line 155)
#   `[ "$output" = 'default 7475' ]' failed
# runc spec --rootless (status=0):
# 
# runc run -d --console-socket /tmp/bats-run-1256838/runc.ZUIMwn/tty/sock test_cgroups_unified (status=0):
# 
# runc exec test_cgroups_unified sh -c cat /sys/fs/cgroup/io.bfq.weight (status=1):
# cat: can't open '/sys/fs/cgroup/io.bfq.weight': No such file or directory
# runc exec test_cgroups_unified sh -c cat /sys/fs/cgroup/io.weight (status=1):
# cat: can't open '/sys/fs/cgroup/io.weight': No such file or directory

This PR fixes the above test case, and also adds a separate test case to make sure we don't see the full cgroup hierarchy.

@kolyshkin kolyshkin marked this pull request as ready for review April 20, 2021 00:12
AkihiroSuda
AkihiroSuda previously approved these changes Apr 20, 2021
@AkihiroSuda AkihiroSuda requested a review from mrunalp April 20, 2021 18:38
Move the initialization of Console* fields as they are unconditional.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The code is already passing three parameters around from
mountToRootfs to mountCgroupV* to mountToRootfs again.

I am about to add another parameter, so let's introduce and
use struct mountConfig to pass around.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. s/cgroupPath/dest/

2. don't hardcode /sys/fs/cgroup

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440 runc bind-mounts the whole
/sys/fs/cgroup into container.

Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.

The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.

Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.

Before:

	$ ./runc run aaa
	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup
	/sys/fs/cgroup/user.slice
	/sys/fs/cgroup/user.slice/user-1000.slice
	/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
	...
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 nobody   nogroup          0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	142 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory

After:

	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup/
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 root     root             0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	2 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	577536

[1] opencontainers#2158

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As the rootless cgroup2 mount is now fixed, we can enable this test
for rootless as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This checks that in-container view of /sys/fs/cgroup does not
contain any extra cgroups (which was the case for rootless
before the previous commit).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@mrunalp mrunalp merged commit 42a18e7 into opencontainers:master Apr 23, 2021
@kolyshkin kolyshkin added this to the 1.0.0-rc94 milestone May 3, 2021
pothos added a commit to flatcar-archive/coreos-overlay that referenced this pull request May 19, 2021
The upstream patch (itself backported for rc93) was backported to rc92.
Original cover letter:
	From 14faf1c20948688a48edb9b41367ab07ac11ca91 Mon Sep 17 00:00:00 2001
	From: Aleksa Sarai <cyphar@cyphar.com>
	Date: Wed, 28 Apr 2021 15:44:36 +1000
	Subject: [PATCH 0/5] rootfs: add mount destination validation

	This is a backport of the fix for CVE-2021-30465 to the v1.0.0-rc93
	release. However, because the patch does not apply cleanly it was
	necessary to backport the following commits (from [1]):

	 * deb8a8dd7767 ("libct/newInitConfig: nit")
	 * 1e476578b6cd ("libct/rootfs: introduce and use mountConfig")
	 * 3826db196d59 ("libct/rootfs/mountCgroupV2: minor refactor")
	 * ff692f289b60 ("Fix cgroup2 mount for rootless case")

	And the patch itself was modified to remove hardenings for code which
	didn't exist in v1.0.0-rc93 (in particular, the mount changes in [2]).

	[1]: opencontainers/runc#2818
	[2]: opencontainers/runc#2798

	Aleksa Sarai (1):
	  rootfs: add mount destination validation

	Kir Kolyshkin (4):
	  libct/newInitConfig: nit
	  libct/rootfs: introduce and use mountConfig
	  libct/rootfs/mountCgroupV2: minor refactor
	  Fix cgroup2 mount for rootless case

	 libcontainer/container_linux.go  |  11 +-
	 libcontainer/init_linux.go       |   1 +
	 libcontainer/rootfs_linux.go     | 291 +++++++++++++++++--------------
	 libcontainer/specconv/example.go |  18 +-
	 libcontainer/utils/utils.go      |  54 ++++++
	 libcontainer/utils/utils_test.go |  35 ++++
	 6 files changed, 263 insertions(+), 147 deletions(-)

	--
	2.31.1
pothos added a commit to flatcar-archive/coreos-overlay that referenced this pull request May 19, 2021
The upstream patch (itself backported for rc93) was backported to rc92.
Original cover letter:
	From 14faf1c20948688a48edb9b41367ab07ac11ca91 Mon Sep 17 00:00:00 2001
	From: Aleksa Sarai <cyphar@cyphar.com>
	Date: Wed, 28 Apr 2021 15:44:36 +1000
	Subject: [PATCH 0/5] rootfs: add mount destination validation

	This is a backport of the fix for CVE-2021-30465 to the v1.0.0-rc93
	release. However, because the patch does not apply cleanly it was
	necessary to backport the following commits (from [1]):

	 * deb8a8dd7767 ("libct/newInitConfig: nit")
	 * 1e476578b6cd ("libct/rootfs: introduce and use mountConfig")
	 * 3826db196d59 ("libct/rootfs/mountCgroupV2: minor refactor")
	 * ff692f289b60 ("Fix cgroup2 mount for rootless case")

	And the patch itself was modified to remove hardenings for code which
	didn't exist in v1.0.0-rc93 (in particular, the mount changes in [2]).

	[1]: opencontainers/runc#2818
	[2]: opencontainers/runc#2798

	Aleksa Sarai (1):
	  rootfs: add mount destination validation

	Kir Kolyshkin (4):
	  libct/newInitConfig: nit
	  libct/rootfs: introduce and use mountConfig
	  libct/rootfs/mountCgroupV2: minor refactor
	  Fix cgroup2 mount for rootless case

	 libcontainer/container_linux.go  |  11 +-
	 libcontainer/init_linux.go       |   1 +
	 libcontainer/rootfs_linux.go     | 291 +++++++++++++++++--------------
	 libcontainer/specconv/example.go |  18 +-
	 libcontainer/utils/utils.go      |  54 ++++++
	 libcontainer/utils/utils_test.go |  35 ++++
	 6 files changed, 263 insertions(+), 147 deletions(-)

	--
	2.31.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants