-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980
Comments
Probably. This is something that has always been quite difficult to do sanely -- because all container limits are applied while But yes this would definitely be caused by the copying procedure. One idea I had was to create a temporary overlayfs such that the binary would not be overwritable -- but that has a lot of other issues that made it implausible. |
Also got the memory problem, my runc binary is 11M, one of our test set memory 10m to run, runc create will fail with error
|
This is a regression. I get some data with docker. Test Environment
$ docker version
Client:
Version: 18.09.0
API version: 1.39
Go version: go1.11.2
Git commit: 4d60db4
Built: Wed Jan 23 19:35:04 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.09.0
API version: 1.39 (minimum version 1.12)
Go version: go1.11.2
Git commit: 4d60db4
Built: Wed Jan 23 19:34:06 2019
OS/Arch: linux/amd64
Experimental: false
Dynamically linked binaries are built with
$ go version
go version go1.11.2 linux/amd64 Test Result
# Memory limit too low, can't even create the container.
$ docker run -m=4m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/5031f5b2cf99b84da41e24836524fb4ae736d6bc4886ce2e0e75c1f43820803f/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
# Memory limit is high enough to create the container, but the init process gets OOM killed right away.
$ docker run -m=5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.
# Memory limit is enough to run the container.
$ docker run -m=6m busybox ls
# ok
$ docker run -m=15m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 15938355 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/6650dbb7f2a8fc8ae58b3ce3d365be8c716321d253d8d29f140e1d45e0dfa818/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
$ docker run -m=15.5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.
$ docker run -m=16m busybox ls
# ok
$ docker run -m=4m busybox ls
# ok
$ docker run -m=9.2m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 9646899 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/bf1ec70dc29ee0a823ba6aebf2a88633cd875e08715f12651451073e86437fc3/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
$ docker run -m=9.3m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.
$ docker run -m=10m busybox ls
# ok ConclusionWe need to set higher memory limit for the container to run, and the minimum limit is larger than the runc binary size |
A good long term fix is to move all the runc init code to C. This should be fairly simple as most of it is system level syscalls. cgroups can remain implemented in Go as it is set by the calling process. |
would that be a separate binary? |
@giuseppe I was thinking same binary, we just never allow it to exec into the go runtime. Maybe that is not possible and we will have the same issue, it's something we would have to look into. I know you have a C implementation, it would be interesting to see where we can combine the two as there are still some areas that are easier to write and maintain in Go and others where C makes more sense, like the init. |
that would be great. If there is anything I can do to help out, just let me know :-) |
If we limit the init in container,don't allow /proc/self/exe,CVE-2019-5736 can be blocked ? |
@keloyang The problem is that you cannot safely verify (in userspace) whether or not you are going to execute |
This seems to be affecting many people. I heard about that if we put In this case, is it possible to opt-out the fix? |
Yes, it would be possible to add the ability opt-out of the fix -- but I'm worried about what happens if someone decides to remount the filesystem as read-write. There's no way for us to deal with that (luckily I am currently working on a patch which will expand the |
@cyphar Actually, I tried hard coding to completely skip the
Diff: index c8a42c23..1817bc72 100644
--- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -32,23 +32,6 @@
#include <sys/sendfile.h>
#include <sys/syscall.h>
-/* Use our own wrapper for memfd_create. */
-#if !defined(SYS_memfd_create) && defined(__NR_memfd_create)
-# define SYS_memfd_create __NR_memfd_create
-#endif
-#ifdef SYS_memfd_create
-# define HAVE_MEMFD_CREATE
-/* memfd_create(2) flags -- copied from <linux/memfd.h>. */
-# ifndef MFD_CLOEXEC
-# define MFD_CLOEXEC 0x0001U
-# define MFD_ALLOW_SEALING 0x0002U
-# endif
-int memfd_create(const char *name, unsigned int flags)
-{
- return syscall(SYS_memfd_create, name, flags);
-}
-#endif
-
/* This comes directly from <linux/fcntl.h>. */
#ifndef F_LINUX_SPECIFIC_BASE
# define F_LINUX_SPECIFIC_BASE 1024
@@ -65,11 +48,6 @@ int memfd_create(const char *name, unsigned int flags)
#endif
#define RUNC_SENDFILE_MAX 0x7FFFF000 /* sendfile(2) is limited to 2GB. */
-#ifdef HAVE_MEMFD_CREATE
-# define RUNC_MEMFD_COMMENT "runc_cloned:/proc/self/exe"
-# define RUNC_MEMFD_SEALS \
- (F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE)
-#endif
static void *must_realloc(void *ptr, size_t size)
{
@@ -93,15 +71,10 @@ static int is_self_cloned(void)
if (fd < 0)
return -ENOTRECOVERABLE;
-#ifdef HAVE_MEMFD_CREATE
- ret = fcntl(fd, F_GET_SEALS);
- is_cloned = (ret == RUNC_MEMFD_SEALS);
-#else
struct stat statbuf = {0};
ret = fstat(fd, &statbuf);
if (ret >= 0)
is_cloned = (statbuf.st_nlink == 0);
-#endif
close(fd);
return is_cloned;
}
@@ -203,11 +176,7 @@ static int clone_binary(void)
int binfd, memfd;
ssize_t sent = 0;
-#ifdef HAVE_MEMFD_CREATE
- memfd = memfd_create(RUNC_MEMFD_COMMENT, MFD_CLOEXEC | MFD_ALLOW_SEALING);
-#else
memfd = open("/tmp", O_TMPFILE | O_EXCL | O_RDWR | O_CLOEXEC, 0711);
-#endif
if (memfd < 0)
return -ENOTRECOVERABLE;
@@ -220,11 +189,6 @@ static int clone_binary(void)
if (sent < 0)
goto error;
-#ifdef HAVE_MEMFD_CREATE
- int err = fcntl(memfd, F_ADD_SEALS, RUNC_MEMFD_SEALS);
- if (err < 0)
- goto error;
-#else
/* Need to re-open "memfd" as read-only to avoid execve(2) giving -EXTBUSY. */
int newfd;
char *fdpath = NULL;
@@ -238,7 +202,6 @@ static int clone_binary(void)
close(memfd);
memfd = newfd;
-#endif
return memfd;
error: |
Yes, you're right -- I incorrectly assumed it's charged to There is a way to do it without using extra memory (and I proposed it internally when we were discussing solutions for this vulnerability), but it has the downside that it can't be done with rootless containers and is a bit ugly. We create a temporary overlayfs mount for the The main downside is that we now would require overlayfs support, and in the case of rootless containers we'd need to make a copy anyway. Not to mention we'd have to have some pretty ugly code to get it all to work -- since we need to set up the mount namespace before we've started doing any operations with the containers' namespaces (otherwise we're poisoning the host mount namespace). But as @crosbymichael said, if we separate out |
Another idea would be to use |
It would still be better to find a way to eliminate this. :/ If we really can't find a better solution, we should at least broadly advertise this, so that users know that they should increase their memory limit. :) In the new GKE release, we are going to mention that because our ubuntu image is going to carry the runc fix. We may need a better channel to advertise that, e.g. runc release note, tweet? |
@cyphar I was arguing for a while to have a way to make regular files content-immutable. |
Or another approach maybe; have runc fork off a once-per-uid (lockfile in |
While that will somewhat solve the problem, I really don't like that solution (and I'm worried what happens if the service is restarted -- how sure are you that the attacking process won't get a chance to overwrite the host binary before it's killed). Personally I think #1984 could be changed so that instead of |
How about copy runc binary file from I have tested it in my server based on v1.0.0-rc5 version. The memory usage is the same as before.
But I'm not sure it works for fixing |
That is precisely what #1984 does -- and the current "no |
@cyphar SG! And actually we don't see this problem in Ubuntu yet, and maybe just because as you said |
With the latest patch #1984 , dynamic build, most of time, it works fine, but there is about a 1 in 10-30 chances to get an OOM killed. Build:
with -m 4m(about 1 in 10 chances failure):
with -m 6m(about 1 in 30 chances failure):
|
@lifubang By default it's still using |
Thank you for your work. |
Is the One approach could be to verify that the immutable attribute was set on the runc binary (and skip the copy if so, or warn/error if not?) |
The easier solution is hopefully going to be to evaluate whether we can delay cgroup application until after the copy -- which should be possible and would be much more fool-proof than that. |
@cyphar Are you working on this? |
I just pushed an update to #1984 -- it uses a read-only bind-mount. It works, and adds nothing to memory usage if it works. For rootless containers it currently won't work, but that's a fairly niche usecase for now. So this problem has been solved -- with the caveat that |
rootless containers don't need the fix, anyways, since a process running as non root can not overwrite a file owned by root. I guess if you were running a runc in your homedir, this could be a problem, but I know of no one who does this. |
Since the whole purpose of (and initial justification for me to work on) rootless containers is precisely to be able to do that, and not require an admin to install binaries for you (if they can install binaries for you, they can install a setuid binary for you too -- defeating the purpose). So the fix is needed regardless. The usernetes distribution works like this, and I hope eventually it will gain wider usage, but I definitely want to protect rootless containers. (These days, "rootless containers" has become a misnomer -- the original idea was that there was no administrator intervention required -- and that's still possible today but it really doesn't help when the concept isn't agreed upon and the default actually uses setuid binaries contrary to the whole point.) |
for the rootless implementations in buildah and podman. We are creating the user namespace before we call into the OCI runtime, so the remount to readonly should just work. Similarly Usernetes does the same thing |
Ah right, because it uses However, the fix is still definitely needed because the "canonical" usernetes installation has |
* General tidyup, now supports ipv6 sockets * Removing ipv4 comment * Making ipv4/v6 agnostic for testing * Fixing user in k8s yaml and ensuring requests large enough to avoid opencontainers/runc#1980 * Refactoring how we get the node name to help testing * Moving errors accepting connections to debug level as we don't [usually] care Co-authored-by: rtweed <rtweed@thoughtmachine.net>
We observed higher memory usage (likely during container startup) after the fix for CVE 0a8e411.
We had a test that specifies 10m container cgroup limit, which never failed before, but now the container get oom-killed a lot. For example https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-containerd-node-e2e-1-2/2500.
It seems to be caused by the memory spike introduced by binary copy. Should we always enforce a minimum memory limit for runc containers in the future?
My runc binary is statically linked:
The text was updated successfully, but these errors were encountered: