nsexec: cloned binary rework #3987

cyphar · 2023-08-18T14:56:28Z

Split out from #3953 to be reviewed separately.

First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Then, carry the code from #3983 and combine it with the above to
keep all of the runc-dmz logic in the Go code as well. The runc-dmz
feature is optional (controlled by the runc_nodmz build tag at
build time, or RUNC_DMZ=legacy at runtime), with a fallback to the
classic /proc/self/exe cloning. If /proc/self/exe is already a
cloned binary then neither runc-dmz nor /proc/self/exe are
cloned.

This also adds a memfd-bind helper binary for users which really
need to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).

Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

README.md

libcontainer/container_linux.go

AkihiroSuda · 2023-08-20T14:14:24Z

libcontainer/dmz/cloned_binary_linux.go

+ }
+ path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd()))
+ if err := unix.Access(path, unix.X_OK); err == nil {
+ return true


The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?

We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).

I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock lock contention again....

I think the new RUNC_DMZ=legacy option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/runc-dmz.c

libcontainer/dmz/cloned_binary_linux.go

Makefile

switched to LANG=C.UTF-8

libcontainer/dmz/cloned_binary_linux.go

libcontainer/system/linux.go

libcontainer/dmz/cloned_binary_linux.go

.github/workflows/test.yml

Makefile

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/dmz.go

libcontainer/dmz/dmz_linux.go

libcontainer/dmz/Makefile

libcontainer/dmz/dmz_linux.go

libcontainer/system/linux.go

.github/workflows/test.yml

contrib/cmd/memfd-bind/memfd-bind.go

contrib/cmd/memfd-bind/README.md

libcontainer/container_linux.go

tests/integration/helpers.bash

kolyshkin

LGTM overall; left some nits but those can be addressed in a followup.

lifubang · 2023-09-19T06:00:22Z

libcontainer/dmz/cloned_binary_linux.go

+ if err != nil {
+ return nil, err
+ }
+ copied, err := io.Copy(file, src)


@cyphar Do you really want to use io.Copy instead of unix.Sendfile?

I don't think that's necessary. I only used sendfile(2) in C because it is easier than doing buffered copies. io.Copy is more than fast enough in Go (it might even use sendfile(2) as an optimisation, I'm not sure).

sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.

‘io.Copy’ may needs more 32kb memory than sendfile().

But maybe no need for a 14MB file copy.

I assumed Go had an (*os.File).ReadFrom implementation that used sendfile(2) (which would make io.Copy use it without a buffer) but it turns out they only have one which uses splice(2) (which does efficient copies to and from pipes). I might send a patch to fix that...

I'm not sure whether it'll makes a real difference, I can write a quick benchmark to check...

Actually, it seems that on some systems the difference can be pretty significant. I've added sendfile(2) acceleration.

I assumed Go had an (*os.File).ReadFrom implementation that used sendfile(2) (which would make io.Copy use it without a buffer) but it turns out they only have one which uses splice(2) (which does efficient copies to and from pipes). I might send a patch to fix that...

@cyphar hmm, from what I see, Go runtime uses copy_file_range(2) to copy files since Go 1.15 (https://go-review.googlesource.com/c/go/+/229101). I haven't checked but I assume that copy_file_range(2) is as efficient as sendfile(2) for this case.

We need these to match the Makefile detection of the right gcc for runc-dmz, as well as making sure that everything builds properly for our cross-i386 tests. While we're at it, add x86 to the list of build targets for release builds (presumably nobody will use it, but since we do test builds of this anyway it probably won't hurt). In addition, clean up the handling of the native architecture build by treating it the same as any other build (ensuring that building runc from a different platform will work the same way regardless of the native architecture). In practice, the build works the same way as before. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Previously, if /var/run was mounted noexec, our cloned binary logic would not work if memfd_create(2) was not available because we would try to exec a binary that is on a noexec filesystem. We cannot guarantee there will be an executable filesystem on the system (other than mounting one ourselves, which would cause a bunch of other headaches) but we can at least try the obvious options (/tmp, /bin, and /). If none of these work, we will have to fail. Reported-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The idea is to remove the need for cloning the entire runc binary by replacing the final execve() call of the container process with an execve() call to a clone of a small C binary which just does an execve() of its arguments. This provides similar protection against CVE-2019-5736 but without requiring a >10MB binary copy for each "runc init". When compiled with musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB which is still quite large). It should be noted that there is still a window where the container processes could get access to the host runc binary, but because we set ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which is not enabled by default in Docker) in order to get around the proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the kernel blocks access entirely for user namespaced containers in this scenario. For those cases we cannot use runc-dmz, but most containers won't have this issue. This new runc-dmz binary can be opted out of at compile time by setting the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy environment variable. In both cases, runc will fall back to the classic /proc/self/exe-based cloning trick. If /proc/self/exe is already a sealed memfd (namely if the user is using contrib/cmd/memfd-bind to create a persistent sealed memfd for runc), neither runc-dmz nor /proc/self/exe cloning will be used because they are not necessary. [1]: torvalds/linux@bfedb58 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [cyphar: address various review nits] [cyphar: fix runc-dmz cross-compilation] [cyphar: embed runc-dmz into runc binary and clone in Go code] [cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning] [cyphar: do not use runc-dmz when the container has certain privs] Co-authored-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

This really isn't ideal but it can be used to avoid the largest issues with the memfd-based runc binary protection. There are several caveats with using this tool, see the help page for the new binary for details. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Due to the way capabilities have to be set by runc, capabilities need to be included in the inheritable and ambient sets anyway. Otherwise, the container process would not have the correct privileges. This test only functioned because adding CAP_DAC_OVERRIDE to the inherited, permissible, and bounding sets means that only "runc init" has these capabilities -- everything other than the bounding set is cleared on the first execve(). This breaks with runc-dmz, but the behaviour was broken from the outset. Docker appears to not handle this properly at all (the logic for capability sets changed with the introduction of ambient capabilities, and while Docker was updated it seems the behaviour is still incorrect for non-root users). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

This results in a 5-20% speedup of dmz.CloneBinary(), depending on the machine. io.Copy: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 139 8075074 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.286s unix.Sendfile: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 192 6382121 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.415s Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

.github/workflows/test.yml

cyphar added this to the 1.2.0 milestone Aug 18, 2023

cyphar marked this pull request as ready for review August 20, 2023 13:58

This was referenced Aug 20, 2023

[Proposal] Use runc-dmz to defeat CVE-2019-5736 #3983

Closed

nsexec: spring cleaning #3953

Closed

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/container_linux.go Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/dmz/dmz_linux.go Outdated Show resolved Hide resolved

AkihiroSuda reviewed Aug 20, 2023

View reviewed changes

libcontainer/dmz/runc-dmz.c Outdated Show resolved Hide resolved

eiffel-fl reviewed Aug 21, 2023

View reviewed changes

libcontainer/dmz/cloned_binary_linux.go Outdated Show resolved Hide resolved

cyphar requested a review from AkihiroSuda August 21, 2023 16:08

lifubang added the status/2-code-review label Aug 22, 2023

lifubang previously requested changes Aug 22, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

cyphar requested a review from lifubang August 23, 2023 07:10