Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill command fails with read cgroup.procs: operation not supported #3821

Closed
amurzeau opened this issue Apr 8, 2023 · 12 comments
Closed

kill command fails with read cgroup.procs: operation not supported #3821

amurzeau opened this issue Apr 8, 2023 · 12 comments

Comments

@amurzeau
Copy link

amurzeau commented Apr 8, 2023

Description

Hi,

While testing buildkit within a docker container, tests use runc.
When tring to kill a runc container, runc error out with and error like this:
read /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.procs: operation not supported
and then return error code 1.
The command line is this one:
runc --root /run/containerd/runc/buildkit --log /tmp/bktest_containerd1141985211/state/io.containerd.runtime.v2.task/buildkit/mxv4shz9kwdm0p5u49mw971ft/log.json --log-format json kill --all mxv4shz9kwdm0p5u49mw971ft 9

Steps to reproduce the issue

  1. Run buildkit tests on a Debian Unstable with docker rootful from docker.io package running the dev-env target from the Dockerfile at the root of buildkit git repository.

Describe the results you received and expected

Several tests using containerd fail with this error:

time="2023-04-08T18:18:07Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = process \"sh -c cat /dev/urandom | head -c 100 | sha256sum > /randomfile\" did not complete successfully: failed to delete task rkpae5w5k41u61w4vqnuoh6gy: unknown error after kill: runc did not terminate successfully: exit status 1: read /sys/fs/cgroup/buildkit/rkpae5w5k41u61w4vqnuoh6gy/cgroup.procs: operation not supported\n: unknown"
process "sh -c cat /dev/urandom | head -c 100 | sha256sum > /randomfile" did not complete successfully: failed to delete task rkpae5w5k41u61w4vqnuoh6gy: unknown error after kill: runc did not terminate successfully: exit status 1: read /sys/fs/cgroup/buildkit/rkpae5w5k41u61w4vqnuoh6gy/cgroup.procs: operation not supported
: unknown

What version of runc are you using?

runc version v1.1.5
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4

Host OS information

Host:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

container running dev-env target from Dockerfile from buildkit git repository:

NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.17.3
PRETTY_NAME="Alpine Linux v3.17"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"

Host kernel information

Linux DOC-PC3 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19) x86_64 GNU/Linux

@amurzeau
Copy link
Author

amurzeau commented Apr 8, 2023

I found that the issue is that the cgroup is in threaded mode, and in that case, reading cgroup.procs returns ENOTSUP.

By patching runc with the following patch, tests work again and runc doesn't fail:

diff --git a/libcontainer/cgroups/utils.go b/libcontainer/cgroups/utils.go
index b32af4ee..70080efd 100644
--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -19,6 +19,7 @@ import (
 
 const (
        CgroupProcesses   = "cgroup.procs"
+       CgroupThreads     = "cgroup.threads"
        unifiedMountpoint = "/sys/fs/cgroup"
        hybridMountpoint  = "/sys/fs/cgroup/unified"
 )
@@ -137,14 +138,16 @@ func GetAllSubsystems() ([]string, error) {
 }
 
 func readProcsFile(dir string) ([]int, error) {
-       f, err := OpenFile(dir, CgroupProcesses, os.O_RDONLY)
+       contents, err := ReadFile(dir, CgroupProcesses)
+       if errors.Is(err, unix.ENOTSUP) {
+               contents, err = ReadFile(dir, CgroupThreads)
+       }
        if err != nil {
                return nil, err
        }
-       defer f.Close()
 
        var (
-               s   = bufio.NewScanner(f)
+               s   = bufio.NewScanner(strings.NewReader(contents))
                out = []int{}
        )

Here is the type of the cgroups (these commands were run inside the buildkit's dev-env container:

# cat /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.type
threaded
# cat /sys/fs/cgroup/buildkit/cgroup.type
threaded
# cat /sys/fs/cgroup/cgroup.type
domain threaded

@amurzeau amurzeau changed the title read cgroup.procs: operation not supported kill command fails with read cgroup.procs: operation not supported Apr 8, 2023
@Bacto
Copy link

Bacto commented Jan 9, 2024

Hi,
I have the same issue (with runc 1.1.10).
Having this patch applied to the next version would be awesome!

@kolyshkin
Copy link
Contributor

@Bacto we've changed this part of runc a lot in the main branch. Can you try to repro this using runc compiled from the main branch?

@Bacto
Copy link

Bacto commented Jan 17, 2024

Hi @kolyshkin,

I tried with the main branch and got the same issue:

# runc -v
runc version 1.1.0+dev
commit: 0c5a735
spec: 1.1.0+dev
go: go1.21.6
libseccomp: 2.5.5

@kolyshkin
Copy link
Contributor

Here is the type of the cgroups (these commands were run inside the buildkit's dev-env container:

# cat /sys/fs/cgroup/buildkit/mxv4shz9kwdm0p5u49mw971ft/cgroup.type
threaded
# cat /sys/fs/cgroup/buildkit/cgroup.type
threaded
# cat /sys/fs/cgroup/cgroup.type
domain threaded

So the problem here is threaded cgroup type. In this case, processes actually belong to the cgroup parent which has "domain threaded" type (i.e. top cgroup in this case). It would be incorrect to send SIGKILL to specific threads in this group. So, basically, runc kill does the right thing here returning an error.

This is some kind of a misconfiguration, possibly caused by buildkit.

@kolyshkin
Copy link
Contributor

Created Debian 12 VM, checked in buildkit and ran its test suite inside a container (make test). Was not able to reproduce.

I think there was something wrong originally when starting a container.

Would still like to get to the bottom of it, so any suggestions of how to reproduce it (ideally a vagrant file or something like this) are welcome.

@amurzeau
Copy link
Author

amurzeau commented Jun 3, 2024

The issue is fixed in main branch.
I've tried again the 1.1.5 version and reproduced it, but I don't reproduce it with the main branch of runc.

I've tried to find the first fixed version and found that I can reproduce the same issue with 1.1.12 but not anymore with 1.2.0-rc.1.

So I'm closing this issue.

For reference, I'm using go test -v -run ^TestIntegration/TestDiffSingleLayer.*$ github.com/moby/buildkit/client -count=1 to run affected tests in buildkit with the tested runc in /usr/bin/runc.

@amurzeau amurzeau closed this as completed Jun 3, 2024
@cathaysia
Copy link

After upgrade runc, docker still report this error:

# docker info
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.2.0-rc.1-0-g275e6d85

error:

java.io.IOException: Failed to run top 'a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6'. Error: Error response from daemon: runc did not terminate successfully: exit status 1: unable to get all container pids: read /sys/fs/cgroup/docker/a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6/cgroup.procs: operation not supported
# cat /sys/fs/cgroup/docker/a77dc1be093aebb8a8f18fd634adc2ebbf2d798a7c7e8e7aa5283770b1efd9b6/cgroup.type
threaded

@cathaysia
Copy link

I tried 8256a93, which fix this problem.

@kolyshkin
Copy link
Contributor

I tried 8256a93, which fix this problem.

I guess you quoted a wrong commit.

@kolyshkin
Copy link
Contributor

@amurzeau could you do git-bisect to find which runc commit fixes it?

@amurzeau
Copy link
Author

amurzeau commented Jun 25, 2024

The first commit without the issue is f8ad20f.
The previous one 9583b3d, still cause the same failure.

The cause is that the failure occurs with this stacktrace:

runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/opencontainers/runc/libcontainer/cgroups.readProcsFile({0xc0000f2d40?, 0xc0000f30c0?})
        /tmp/runc/libcontainer/cgroups/utils.go:166 +0x372
github.com/opencontainers/runc/libcontainer/cgroups.GetAllPids.func1({0xc0000f2d40, 0x31}, {0x399dc0?, 0xc0001695b0?}, {0x0?, 0x0?})
        /tmp/runc/libcontainer/cgroups/getallpids.go:19 +0x79
path/filepath.walkDir({0xc0000f2d40, 0x31}, {0x399dc0, 0xc0001695b0}, 0xc000135180)
        /usr/local/go/src/path/filepath/path.go:445 +0x5c
path/filepath.WalkDir({0xc0000f2d40, 0x31}, 0xc000135180)
        /usr/local/go/src/path/filepath/path.go:535 +0xb0
github.com/opencontainers/runc/libcontainer/cgroups.GetAllPids({0xc0000f2d40?, 0x6?})
        /tmp/runc/libcontainer/cgroups/getallpids.go:12 +0x4e
github.com/opencontainers/runc/libcontainer/cgroups/fs2.(*Manager).GetAllPids(0xc0000feb60?)
        /tmp/runc/libcontainer/cgroups/fs2/fs2.go:92 +0x25
github.com/opencontainers/runc/libcontainer.signalAllProcesses({0x39d1c0, 0xc0000feb60}, 0x0?)
        /tmp/runc/libcontainer/init_linux.go:583 +0xad
github.com/opencontainers/runc/libcontainer.(*Container).Signal(0xc0000bb220, {0x398770?, 0xaaa8a8}, 0x1)
        /tmp/runc/libcontainer/container_linux.go:383 +0x265
main.glob..func7(0xc0000c8580)
        /tmp/runc/kill.go:52 +0x113
github.com/urfave/cli.HandleAction({0x2467a0?, 0x323b50?}, 0x4?)
        /tmp/runc/vendor/github.com/urfave/cli/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x2dfe61, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x307052, 0x52}, {0x0, ...}, ...}, ...)
        /tmp/runc/vendor/github.com/urfave/cli/command.go:175 +0x67b
github.com/urfave/cli.(*App).Run(0xc0000ea380, {0xc0000b4000, 0xb, 0xb})
        /tmp/runc/vendor/github.com/urfave/cli/app.go:277 +0xb87
main.main()
        /tmp/runc/main.go:165 +0x1208

The commit that fixes the issue (f8ad20f) removes the call to c.ignoreCgroupError(signalAllProcesses(c.cgroupManager, sig)) which was part of the stacktrace.

I think this can be reproduced with this bundle:
runctest_no_pid_namespace.tar.gz

To test: cd runctest && ./test.sh
The bundle's rootfs just contain a busybox binary at usr/bin/sh and usr/bin/sleep with linker dependency if needed (/lib/ld-whatever.so).

Running runc kill --all yield the error at commit 9583b3d:

ERRO[0000] read /sys/fs/cgroup/buildkit/runctest/cgroup.procs: operation not supported

buildkit / containerd use a pid namespace, so after f8ad20f, signalAllProcesses is not called anymore.
But without a pid namespace (as in my runctest test), it is still called:

if s == unix.SIGKILL && !c.config.Namespaces.IsPrivate(configs.NEWPID) {
err = signalAllProcesses(c.cgroupManager, unix.SIGKILL)
} else {

And thus still trigger the error (so I'm not sure the commit really fix the issue):

ERRO[0000] unable to signal init: read /sys/fs/cgroup/buildkit/runctest/cgroup.procs: operation not supported

Note: I'm running this test in a docker rootful container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants