-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsexec: cloned binary rework #3987
Conversation
} | ||
path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd())) | ||
if err := unix.Access(path, unix.X_OK); err == nil { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp
program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).
I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock
lock contention again....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new RUNC_DMZ=legacy
option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall; left some nits but those can be addressed in a followup.
if err != nil { | ||
return nil, err | ||
} | ||
copied, err := io.Copy(file, src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar Do you really want to use io.Copy
instead of unix.Sendfile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary. I only used sendfile(2)
in C because it is easier than doing buffered copies. io.Copy
is more than fast enough in Go (it might even use sendfile(2)
as an optimisation, I'm not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.
- ‘io.Copy’ may needs more 32kb memory than sendfile().
But maybe no need for a 14MB file copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed Go had an (*os.File).ReadFrom
implementation that used sendfile(2)
(which would make io.Copy
use it without a buffer) but it turns out they only have one which uses splice(2)
(which does efficient copies to and from pipes). I might send a patch to fix that...
I'm not sure whether it'll makes a real difference, I can write a quick benchmark to check...
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems that on some systems the difference can be pretty significant. I've added sendfile(2)
acceleration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed Go had an (*os.File).ReadFrom implementation that used sendfile(2) (which would make io.Copy use it without a buffer) but it turns out they only have one which uses splice(2) (which does efficient copies to and from pipes). I might send a patch to fix that...
@cyphar hmm, from what I see, Go runtime uses copy_file_range(2)
to copy files since Go 1.15 (https://go-review.googlesource.com/c/go/+/229101). I haven't checked but I assume that copy_file_range(2)
is as efficient as sendfile(2)
for this case.
We need these to match the Makefile detection of the right gcc for runc-dmz, as well as making sure that everything builds properly for our cross-i386 tests. While we're at it, add x86 to the list of build targets for release builds (presumably nobody will use it, but since we do test builds of this anyway it probably won't hurt). In addition, clean up the handling of the native architecture build by treating it the same as any other build (ensuring that building runc from a different platform will work the same way regardless of the native architecture). In practice, the build works the same way as before. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Previously, if /var/run was mounted noexec, our cloned binary logic would not work if memfd_create(2) was not available because we would try to exec a binary that is on a noexec filesystem. We cannot guarantee there will be an executable filesystem on the system (other than mounting one ourselves, which would cause a bunch of other headaches) but we can at least try the obvious options (/tmp, /bin, and /). If none of these work, we will have to fail. Reported-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The idea is to remove the need for cloning the entire runc binary by replacing the final execve() call of the container process with an execve() call to a clone of a small C binary which just does an execve() of its arguments. This provides similar protection against CVE-2019-5736 but without requiring a >10MB binary copy for each "runc init". When compiled with musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB which is still quite large). It should be noted that there is still a window where the container processes could get access to the host runc binary, but because we set ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which is not enabled by default in Docker) in order to get around the proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the kernel blocks access entirely for user namespaced containers in this scenario. For those cases we cannot use runc-dmz, but most containers won't have this issue. This new runc-dmz binary can be opted out of at compile time by setting the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy environment variable. In both cases, runc will fall back to the classic /proc/self/exe-based cloning trick. If /proc/self/exe is already a sealed memfd (namely if the user is using contrib/cmd/memfd-bind to create a persistent sealed memfd for runc), neither runc-dmz nor /proc/self/exe cloning will be used because they are not necessary. [1]: torvalds/linux@bfedb58 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [cyphar: address various review nits] [cyphar: fix runc-dmz cross-compilation] [cyphar: embed runc-dmz into runc binary and clone in Go code] [cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning] [cyphar: do not use runc-dmz when the container has certain privs] Co-authored-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This really isn't ideal but it can be used to avoid the largest issues with the memfd-based runc binary protection. There are several caveats with using this tool, see the help page for the new binary for details. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Due to the way capabilities have to be set by runc, capabilities need to be included in the inheritable and ambient sets anyway. Otherwise, the container process would not have the correct privileges. This test only functioned because adding CAP_DAC_OVERRIDE to the inherited, permissible, and bounding sets means that only "runc init" has these capabilities -- everything other than the bounding set is cleared on the first execve(). This breaks with runc-dmz, but the behaviour was broken from the outset. Docker appears to not handle this properly at all (the logic for capability sets changed with the introduction of ambient capabilities, and while Docker was updated it seems the behaviour is still incorrect for non-root users). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This results in a 5-20% speedup of dmz.CloneBinary(), depending on the machine. io.Copy: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 139 8075074 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.286s unix.Sendfile: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 192 6382121 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.415s Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Split out from #3953 to be reviewed separately.
First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Then, carry the code from #3983 and combine it with the above to
keep all of the
runc-dmz
logic in the Go code as well. Therunc-dmz
feature is optional (controlled by the
runc_nodmz
build tag atbuild time, or
RUNC_DMZ=legacy
at runtime), with a fallback to theclassic
/proc/self/exe
cloning. If/proc/self/exe
is already acloned binary then neither
runc-dmz
nor/proc/self/exe
arecloned.
This also adds a
memfd-bind
helper binary for users which reallyneed to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).
Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai cyphar@cyphar.com