[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

Random-Liu · 2019-02-13T00:05:44Z

We observed higher memory usage (likely during container startup) after the fix for CVE 0a8e411.

We had a test that specifies 10m container cgroup limit, which never failed before, but now the container get oom-killed a lot. For example https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-containerd-node-e2e-1-2/2500.

kernel: runc:[2:INIT] invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=998
kernel: runc:[2:INIT] cpuset=80e651c417ebd71d83e5023ee59b281e585497468bd71ee7c7b3ae6730d9ec8f mems_allowed=0
kernel: CPU: 0 PID: 333 Comm: runc:[2:INIT] Not tainted 4.4.64+ #1
kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
kernel:  0000000000000000 ffff880003e87ca8 ffffffff9f317334 ffff880003e87d88
kernel:  ffff8800bb3e8000 ffff880003e87d18 ffffffff9f1a8fb4 ffff880003e87ce0
kernel:  ffffffff9f13e780 ffff8800bb3eb500 0000000000000206 ffff880003e87cf0
kernel: Call Trace:
kernel:  [<ffffffff9f317334>] dump_stack+0x63/0x8f
kernel:  [<ffffffff9f1a8fb4>] dump_header+0x65/0x1d4
kernel:  [<ffffffff9f13e780>] ? find_lock_task_mm+0x20/0xb0
kernel:  [<ffffffff9f13ef1d>] oom_kill_process+0x28d/0x430
kernel:  [<ffffffff9f1a3e6b>] ? mem_cgroup_iter+0x1db/0x390
kernel:  [<ffffffff9f1a6374>] mem_cgroup_out_of_memory+0x284/0x2d0
kernel:  [<ffffffff9f1a6de9>] mem_cgroup_oom_synchronize+0x2f9/0x310
kernel:  [<ffffffff9f1a1ab0>] ? memory_high_write+0xc0/0xc0
kernel:  [<ffffffff9f13f5f8>] pagefault_out_of_memory+0x38/0xa0
kernel:  [<ffffffff9f045a27>] mm_fault_error+0x77/0x150
kernel:  [<ffffffff9f046264>] __do_page_fault+0x414/0x420
kernel:  [<ffffffff9f046292>] do_page_fault+0x22/0x30
kernel:  [<ffffffff9f5b1f98>] page_fault+0x28/0x30

It seems to be caused by the memory spike introduced by binary copy. Should we always enforce a minimum memory limit for runc containers in the future?

My runc binary is statically linked:

$ ls -alh usr/local/sbin/runc 
-rwxr-xr-x 1 lantaol primarygroup 7.8M Feb 12 13:46 usr/local/sbin/runc

The text was updated successfully, but these errors were encountered:

cyphar · 2019-02-13T00:26:15Z

Should we always enforce a minimum memory limit for runc containers in the future?

Probably. This is something that has always been quite difficult to do sanely -- because all container limits are applied while runc is still executing. While we do get out of your way eventually, the only safe way of constraining the user process is to limit ourselves first -- so knowing what the minimum limit should be is quite difficult (very low pids limits are hard to set). You could try to do a runc update afterwards...

But yes this would definitely be caused by the copying procedure. One idea I had was to create a temporary overlayfs such that the binary would not be overwritable -- but that has a lot of other issues that made it implausible.

Ace-Tang · 2019-02-13T10:07:47Z

Also got the memory problem, my runc binary is 11M, one of our test set memory 10m to run, runc create will fail with error

\"process_linux.go:424: container init caused \\\"process_linux.go:390: setting cgroup config for procHooks process caused \\\\\\\"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/default/96cd18b0b87ba67c64b6fe535607c29af30119229328add6658404239a479905/memory.limit_in_bytes: device or resource busy\\\\\\\"\\\"\"

Random-Liu · 2019-02-13T10:36:28Z

This is a regression. I get some data with docker.

Test Environment

Machine type: GCE n1-standard-1 (1 vCPU, 3.75 GB memory)
Docker version:

$ docker version
Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.11.2
 Git commit:        4d60db4
 Built:             Wed Jan 23 19:35:04 2019
 OS/Arch:           linux/amd64
 Experimental:      false
Server:
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.2
  Git commit:       4d60db4
  Built:            Wed Jan 23 19:34:06 2019
  OS/Arch:          linux/amd64
  Experimental:     false

runc binaries:

$ ls -alh
total 37M
drwxr-xr-x 2 lantaol lantaol 4.0K Feb 13 10:12 .
drwxr-xr-x 4 lantaol lantaol 4.0K Feb 13 10:11 ..
-rwxr-xr-x 1 lantaol lantaol  11M Feb 13 10:11 runc-dynamic-12f6a991
-rwxr-xr-x 1 lantaol lantaol  11M Feb 13 10:10 runc-dynamic-6635b4f0
-rwxr-xr-x 1 lantaol lantaol 7.7M Feb 13 10:05 runc-static-12f6a991
-rwxr-xr-x 1 lantaol lantaol 7.8M Feb 13 09:57 runc-static-6635b4f0

Dynamically linked binaries are built with make BUILDTAGS="seccomp apparmor".
Statically linked binaries are built with make static BUILDTAGS="seccomp apparmor".
12f6a99 is the last runc version we use in containerd.
6635b4f is the new runc version with the CVE fix.

Golang version:

$ go version
go version go1.11.2 linux/amd64

Test Result

runc-dynamic-12f6a991:

# Memory limit too low, can't even create the container.
$ docker run -m=4m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/5031f5b2cf99b84da41e24836524fb4ae736d6bc4886ce2e0e75c1f43820803f/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

# Memory limit is high enough to create the container, but the init process gets OOM killed right away.
$ docker run -m=5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

# Memory limit is enough to run the container.
$ docker run -m=6m busybox ls
# ok

runc-dynamic-6635b4f0:

$ docker run -m=15m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 15938355 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/6650dbb7f2a8fc8ae58b3ce3d365be8c716321d253d8d29f140e1d45e0dfa818/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=15.5m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=16m busybox ls
# ok

runc-static-12f6a991:

$ docker run -m=4m busybox ls
# ok

runc-static-6635b4f0:

$ docker run -m=9.2m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 9646899 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/bf1ec70dc29ee0a823ba6aebf2a88633cd875e08715f12651451073e86437fc3/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=9.3m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=10m busybox ls
# ok

Conclusion

We need to set higher memory limit for the container to run, and the minimum limit is larger than the runc binary size
Before 6635b4f, the minimum limit is not that high, and much lower than the runc binary size.
This is a regression to users, their existing workloads may not run without tweaking memory limit.

crosbymichael · 2019-02-13T15:46:10Z

A good long term fix is to move all the runc init code to C. This should be fairly simple as most of it is system level syscalls. cgroups can remain implemented in Go as it is set by the calling process.

giuseppe · 2019-02-13T19:22:14Z

A good long term fix is to move all the runc init code to C. This should be fairly simple as most of it is system level syscalls. cgroups can remain implemented in Go as it is set by the calling process.

would that be a separate binary?

crosbymichael · 2019-02-13T19:25:16Z

@giuseppe I was thinking same binary, we just never allow it to exec into the go runtime. Maybe that is not possible and we will have the same issue, it's something we would have to look into.

I know you have a C implementation, it would be interesting to see where we can combine the two as there are still some areas that are easier to write and maintain in Go and others where C makes more sense, like the init.

giuseppe · 2019-02-13T21:01:31Z

I know you have a C implementation, it would be interesting to see where we can combine the two as there are still some areas that are easier to write and maintain in Go and others where C makes more sense, like the init.

that would be great. If there is anything I can do to help out, just let me know :-)

keloyang · 2019-02-14T03:52:29Z

If we limit the init in container,don't allow /proc/self/exe，CVE-2019-5736 can be blocked ?

cyphar · 2019-02-14T12:42:03Z

@keloyang The problem is that you cannot safely verify (in userspace) whether or not you are going to execute /proc/self/exe -- and because of #!/proc/self/exe even O_PATH and execveat(AT_EMPTY_PATH) won't save you.

Random-Liu · 2019-02-14T18:53:08Z

This seems to be affecting many people.

I heard about that if we put runc on a readonly filesystem (mount -o remount,ro), we won't be affected by the CVE.

In this case, is it possible to opt-out the fix?

cyphar · 2019-02-14T20:16:08Z

Yes, it would be possible to add the ability opt-out of the fix -- but I'm worried about what happens if someone decides to remount the filesystem as read-write. There's no way for us to deal with that (luckily runc's design means there's no long-running processes but what if the remount happens while runc is doing an operation -- then we have a race where you can attack the binary).

I am currently working on a patch which will expand the O_TMPFILE fallback (which doesn't use memory -- though it does create a file on tmpfs) -- and we can discuss making memfd_create opt-outable such that the O_TMPFILE fallback will be used. But I'm absolutely against making it easy to entirely disable the fix -- that seems like a really bad idea.

Random-Liu · 2019-02-14T23:03:23Z

@cyphar tmpfs memory usage is also charged to the memory cgroup, right? Why is it better than memfd_create. Just for my education.

Actually, I tried hard coding to completely skip the memfd_create code path, but get the same result:

$ ls -alh binary/runc-static-patched-6635b4f0 
-rwxr-xr-x 1 lantaol lantaol 7.8M Feb 14 22:58 binary/runc-static-patched-6635b4f0

$ docker run -m=9.2m busybox ls
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 9646899 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/f44f29ca39e754892e73d15c03d3ff12b46388fb13ea4c01100835b78f7afa75/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.

$ docker run -m=9.3m busybox ls
docker: Error response from daemon: cannot start a stopped process: unknown.

$ docker run -m=10m busybox ls
ok

Diff:

index c8a42c23..1817bc72 100644
--- a/libcontainer/nsenter/cloned_binary.c
+++ b/libcontainer/nsenter/cloned_binary.c
@@ -32,23 +32,6 @@
 #include <sys/sendfile.h>
 #include <sys/syscall.h>
 
-/* Use our own wrapper for memfd_create. */
-#if !defined(SYS_memfd_create) && defined(__NR_memfd_create)
-#  define SYS_memfd_create __NR_memfd_create
-#endif
-#ifdef SYS_memfd_create
-#  define HAVE_MEMFD_CREATE
-/* memfd_create(2) flags -- copied from <linux/memfd.h>. */
-#  ifndef MFD_CLOEXEC
-#    define MFD_CLOEXEC       0x0001U
-#    define MFD_ALLOW_SEALING 0x0002U
-#  endif
-int memfd_create(const char *name, unsigned int flags)
-{
-	return syscall(SYS_memfd_create, name, flags);
-}
-#endif
-
 /* This comes directly from <linux/fcntl.h>. */
 #ifndef F_LINUX_SPECIFIC_BASE
 #  define F_LINUX_SPECIFIC_BASE 1024
@@ -65,11 +48,6 @@ int memfd_create(const char *name, unsigned int flags)
 #endif
 
 #define RUNC_SENDFILE_MAX 0x7FFFF000 /* sendfile(2) is limited to 2GB. */
-#ifdef HAVE_MEMFD_CREATE
-#  define RUNC_MEMFD_COMMENT "runc_cloned:/proc/self/exe"
-#  define RUNC_MEMFD_SEALS \
-	(F_SEAL_SEAL | F_SEAL_SHRINK | F_SEAL_GROW | F_SEAL_WRITE)
-#endif
 
 static void *must_realloc(void *ptr, size_t size)
 {
@@ -93,15 +71,10 @@ static int is_self_cloned(void)
 	if (fd < 0)
 		return -ENOTRECOVERABLE;
 
-#ifdef HAVE_MEMFD_CREATE
-	ret = fcntl(fd, F_GET_SEALS);
-	is_cloned = (ret == RUNC_MEMFD_SEALS);
-#else
 	struct stat statbuf = {0};
 	ret = fstat(fd, &statbuf);
 	if (ret >= 0)
 		is_cloned = (statbuf.st_nlink == 0);
-#endif
 	close(fd);
 	return is_cloned;
 }
@@ -203,11 +176,7 @@ static int clone_binary(void)
 	int binfd, memfd;
 	ssize_t sent = 0;
 
-#ifdef HAVE_MEMFD_CREATE
-	memfd = memfd_create(RUNC_MEMFD_COMMENT, MFD_CLOEXEC | MFD_ALLOW_SEALING);
-#else
 	memfd = open("/tmp", O_TMPFILE | O_EXCL | O_RDWR | O_CLOEXEC, 0711);
-#endif
 	if (memfd < 0)
 		return -ENOTRECOVERABLE;
 
@@ -220,11 +189,6 @@ static int clone_binary(void)
 	if (sent < 0)
 		goto error;
 
-#ifdef HAVE_MEMFD_CREATE
-	int err = fcntl(memfd, F_ADD_SEALS, RUNC_MEMFD_SEALS);
-	if (err < 0)
-		goto error;
-#else
 	/* Need to re-open "memfd" as read-only to avoid execve(2) giving -EXTBUSY. */
 	int newfd;
 	char *fdpath = NULL;
@@ -238,7 +202,6 @@ static int clone_binary(void)
 
 	close(memfd);
 	memfd = newfd;
-#endif
 	return memfd;
 
 error:

cyphar · 2019-02-15T07:33:05Z

tmpfs memory usage is also charged to the memory cgroup, right? Why is it better than memfd_create. Just for my education.

Yes, you're right -- I incorrectly assumed it's charged to kmem not mem. In that case it's not any better -- though it does work on older kernels (which is the primary reason why I'm working on improving the fallbacks).

There is a way to do it without using extra memory (and I proposed it internally when we were discussing solutions for this vulnerability), but it has the downside that it can't be done with rootless containers and is a bit ugly. We create a temporary overlayfs mount for the runc binary (with a tmpfs as the upper layer) and then use the merged version as /proc/self/exe. This allows us to avoid any copying -- and any attempt to overwrite the binary will just modify the upperdir (which would be unique per-container). It also gives us page-cache sharing for the runc binary.

The main downside is that we now would require overlayfs support, and in the case of rootless containers we'd need to make a copy anyway. Not to mention we'd have to have some pretty ugly code to get it all to work -- since we need to set up the mount namespace before we've started doing any operations with the containers' namespaces (otherwise we're poisoning the host mount namespace).

But as @crosbymichael said, if we separate out runc init and make it all C code then the binary size would be a much smaller problem (no pun intended).

cyphar · 2019-02-15T12:15:46Z

Another idea would be to use O_TMPFILE on the runc state directory (which runc needs to have write access to) and thus we'd be able to avoid charging memcg when using that. The main problem is that we aren't able to get ENOMEM with memfd_create -- because the only way we grow the size of the memfd is by writing to it (and thus we get hit with an OOM).

Random-Liu · 2019-02-15T17:26:46Z

It would still be better to find a way to eliminate this. :/

If we really can't find a better solution, we should at least broadly advertise this, so that users know that they should increase their memory limit. :) In the new GKE release, we are going to mention that because our ubuntu image is going to carry the runc fix.

We may need a better channel to advertise that, e.g. runc release note, tweet?

cgwalters · 2019-02-15T23:24:07Z

@cyphar I was arguing for a while to have a way to make regular files content-immutable.

cgwalters · 2019-02-16T22:36:44Z

Or another approach maybe; have runc fork off a once-per-uid (lockfile in /run?) instance of itself that holds open the binary. Or require distributors to ship a systemd unit that does that, add runc sleep and ExecStart=/usr/bin/runc sleep ?

cyphar · 2019-02-17T05:09:03Z

While that will somewhat solve the problem, I really don't like that solution (and I'm worried what happens if the service is restarted -- how sure are you that the attacking process won't get a chance to overwrite the host binary before it's killed).

Personally I think #1984 could be changed so that instead of /tmp we use the runc state directory and add an environment variable you can use to disable the use of memfd_create. I think that would be a much simpler solution that doesn't require removing the actual protection -- because while /proc/self/exe re-opening can be blocked by just having the process stick around I really don't trust passing a struct file to a host binary to a container.

lifubang · 2019-02-18T09:37:59Z

How about copy runc binary file from /usr/bin to /tmp with a new random file name before we run runc init? We use this new binary file to exec runc init.
Because the copy action is not in runc init, so it's memory usage will not be charged to the container's memory usage.
After runc init process started, we delete this temp runc binary file in /tmp.
the only problem is that:
Though the temp runc binary file in /tmp is deleted, the link /proc/self/exe also can be exec in syscall.Exec when starting container's init command.
So the CVE-2019-5736 can also change the temp runc binary file content in /tmp.
So, if the administrator run these temp runc binary files in /tmp just for fun, it may also run a virus. But I think it may not effect docker and k8s.

I have tested it in my server based on v1.0.0-rc5 version. The memory usage is the same as before.

root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/ubuntu# docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
dev
etc
home
proc
root
sys
tmp
usr
var

But I'm not sure it works for fixing CVE-2019-5736.
If it does make sense, I'll send a pr.

cyphar · 2019-02-18T09:39:29Z

@lifubang

How about copy runc binary file from /usr/bin to /tmp with a new random file name before we run runc init?

That is precisely what #1984 does -- and the current "no memfd_create(2)" fallback does as well. It gets around the problem you mention by unlinking it before using it, so that you can't accidentally use the binary (since the only reference to it is the file descriptor).

Random-Liu · 2019-02-22T01:22:24Z

I am working on an update to #1984 which will change it so that it uses c.root (the runc state directory) as the O_TMPFILE target -- which means that even on distributions that use /tmp as tmpfs you won't have this issue.

@cyphar SG!

And actually we don't see this problem in Ubuntu yet, and maybe just because as you said /tmp is not a tmpfs on our Ubuntu.

lifubang · 2019-02-22T01:45:02Z

With the latest patch #1984 , dynamic build, most of time, it works fine, but there is about a 1 in 10-30 chances to get an OOM killed.

Build:

root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers# git clone https://github.com/opencontainers/runc
正克隆到 'runc'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 19197 (delta 1), reused 1 (delta 0), pack-reused 19193
接收对象中: 100% (19197/19197), 7.15 MiB | 1.39 MiB/s, 完成.
处理 delta 中: 100% (12414/12414), 完成.
检查连接... 完成。
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers# cd runc
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# git fetch origin pull/1984/head:pr-1984
remote: Enumerating objects: 19, done.
remote: Counting objects: 100% (19/19), done.
remote: Compressing objects: 100% (9/9), done.
展开对象中: 100% (22/22), 完成.
remote: Total 22 (delta 14), reused 13 (delta 10), pack-reused 3
来自 https://github.com/opencontainers/runc
 * [新引用]          refs/pull/1984/head -> pr-1984
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# git checkout pr-1984
切换到分支 'pr-1984'
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# make  BUILDTAGS="apparmor selinux seccomp"
go build -buildmode=pie  -ldflags "-X main.gitCommit="89c07553177c66cfe07de11fcdc17b59621ead07" -X main.version=1.0.0-rc6+dev " -tags "apparmor selinux seccomp" -o runc .
root@iZ2ze1o61blvco5p5ducnnZ:/opt/mygo/src/github.com/opencontainers/runc# cp ./runc /usr/bin/docker-runc

with -m 4m(about 1 in 10 chances failure):

root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
...

real	0m1.026s
user	0m0.080s
sys	0m0.008s
root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 4m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\\"failed to write 4194304 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/docker/a3f6c2cbe4b04f03f51526c8654c9b7ff7b671dc1e0946920ddbada0f2ff19e5/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

real	0m0.741s
user	0m0.064s
sys	0m0.024s

with -m 6m(about 1 in 30 chances failure):

root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 6m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
bin
...

real	0m1.083s
user	0m0.076s
sys	0m0.012s
root@iZ2ze1o61blvco5p5ducnnZ:/opt# time docker run --rm -m 6m busybox ls
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
docker: Error response from daemon: cannot start a stopped process: unknown.
ERRO[0001] error waiting for container: context canceled 

real	0m1.050s
user	0m0.080s
sys	0m0.008s

cyphar · 2019-02-22T02:35:17Z

@lifubang By default it's still using memfd_create(2) -- you would need to pass _LIBCONTAINER_DISABLE_MEMFD_CLONE=1 as an environment variable to stop it. However, I am thinking of just fixing when cgroup.Apply() occurs so that this doesn't happen anymore. I don't like having to pass an environment variable to disable something like that.

lifubang · 2019-02-22T04:33:32Z

Thank you for your work.
_LIBCONTAINER_DISABLE_MEMFD_CLONE=1 can't work at that time. Please see:
#1984 (comment)

justinsb · 2019-02-22T15:46:05Z

Is the chattr +i /sbin/runc approach equivalent to the copying? (e.g. https://seclists.org/oss-sec/2019/q1/134 ) I would imagine this approach would not have any significant memory overhead.

One approach could be to verify that the immutable attribute was set on the runc binary (and skip the copy if so, or warn/error if not?)

cyphar · 2019-02-22T15:51:33Z

chattr +i is fine so long as you haven't given CAP_LINUX_IMMUTABLE to the container -- I'm not really a fan of having to figure out whether that's true as part of the mitigation -- the code is already complicated enough.

The easier solution is hopefully going to be to evaluate whether we can delay cgroup application until after the copy -- which should be possible and would be much more fool-proof than that.

Random-Liu · 2019-02-26T10:30:35Z

The easier solution is hopefully going to be to evaluate whether we can delay cgroup application until after the copy -- which should be possible and would be much more fool-proof than that.

@cyphar Are you working on this?

cyphar · 2019-02-26T12:13:00Z

I just pushed an update to #1984 -- it uses a read-only bind-mount. It works, and adds nothing to memory usage if it works. For rootless containers it currently won't work, but that's a fairly niche usecase for now.

So this problem has been solved -- with the caveat that runc must be run with sufficient privileges to create a bind-mount (which is the case for Kubernetes and Docker, even if you have user namespaces enabled).

rhatdan · 2019-02-26T13:45:04Z

rootless containers don't need the fix, anyways, since a process running as non root can not overwrite a file owned by root.

I guess if you were running a runc in your homedir, this could be a problem, but I know of no one who does this.

cyphar · 2019-02-26T13:55:37Z

I guess if you were running a runc in your homedir, this could be a problem, but I know of no one who does this.

Since the whole purpose of (and initial justification for me to work on) rootless containers is precisely to be able to do that, and not require an admin to install binaries for you (if they can install binaries for you, they can install a setuid binary for you too -- defeating the purpose). So the fix is needed regardless. The usernetes distribution works like this, and I hope eventually it will gain wider usage, but I definitely want to protect rootless containers.

(These days, "rootless containers" has become a misnomer -- the original idea was that there was no administrator intervention required -- and that's still possible today but it really doesn't help when the concept isn't agreed upon and the default actually uses setuid binaries contrary to the whole point.)

giuseppe · 2019-02-26T14:03:59Z

for the rootless implementations in buildah and podman. We are creating the user namespace before we call into the OCI runtime, so the remount to readonly should just work. Similarly Usernetes does the same thing

cyphar · 2019-02-26T14:34:36Z

Ah right, because it uses rootlesskit -- sorry for misstating that. 😉

However, the fix is still definitely needed because the "canonical" usernetes installation has runc installed as the same user as the person who ran it. Obviously with newuidmap you are protected, but we still need protections for "true rootless" (no funny business) mode.

See opencontainers/runc#1980

…pencontainers/runc#1980

* General tidyup, now supports ipv6 sockets * Removing ipv4 comment * Making ipv4/v6 agnostic for testing * Fixing user in k8s yaml and ensuring requests large enough to avoid opencontainers/runc#1980 * Refactoring how we get the node name to help testing * Moving errors accepting connections to debug level as we don't [usually] care Co-authored-by: rtweed <rtweed@thoughtmachine.net>

Random-Liu changed the title ~~Runc uses more memory during start up.~~ [CVE-2019-5736] Runc uses more memory during start up after the fix Feb 14, 2019

Random-Liu changed the title ~~[CVE-2019-5736] Runc uses more memory during start up after the fix~~ [CVE-2019-5736]: Runc uses more memory during start up after the fix Feb 14, 2019

ravilr mentioned this issue Feb 14, 2019

memory usage hit limit for rules-configmap-reloader and config-reloader prometheus-operator/prometheus-operator#2364

Closed

Ace-Tang mentioned this issue Feb 14, 2019

test: increate the memory limit to TestRunWithMemoryswap2x AliyunContainerService/pouch#2710

Merged

cyphar mentioned this issue Feb 15, 2019

nsenter: cloned_binary: "memfd" cleanups #1984

Merged

dongsupark mentioned this issue Feb 18, 2019

Unable to start container using low memory limit coreos/bugs#2551

Open

thaJeztah mentioned this issue Feb 19, 2019

Can't run/start any docker container after update docker/for-linux#597

Closed

3 tasks

lifubang mentioned this issue Feb 23, 2019

[CVE-2019-5736]: Server uses more memory if start many runc process at one time #1993

Closed

mrunalp closed this as completed in #1984 Mar 7, 2019

rabbitfang mentioned this issue Mar 21, 2019

Low memory limit causes OCI runtime create failed when running container moby/moby#38921

Closed

jimangel mentioned this issue Apr 1, 2019

Are there plans to back port a memory fix into patched rancher/runc-cve releases? rancher/runc-cve#21

Closed

This was referenced May 31, 2019

failed to launch command with executor: rpc error: hashicorp/nomad#5576

Closed

Nomad task got OOM killed when it was using only ~70% of it's MemoryMB limit hashicorp/nomad#4495

Closed

mcastelino mentioned this issue Jul 15, 2019

Some pods fail to launch with cgroup related issues clearlinux/cloud-native-setup#115

Closed

cgwalters added a commit to cgwalters/mantle that referenced this issue Oct 10, 2019

kola/podman: Bump memory tests to 128m for runc

51a729b

See opencontainers/runc#1980

cgwalters mentioned this issue Oct 10, 2019

kola/podman: Bump memory tests to 128m for runc coreos/mantle#1085

Merged

wglambert mentioned this issue Feb 18, 2020

docker-entrypoint.sh: permission denied docker-library/mysql#636

Closed

RichardoC added a commit to thought-machine/conntest that referenced this issue Mar 10, 2020

Fixing user in k8s yaml and ensuring requests large enough to avoid o…

67e7b26

…pencontainers/runc#1980

This was referenced Jul 1, 2020

Containers with memory limit < 6M fail with obscure error message moby/moby#41167

Closed

Set minimum memory limit to 6M, to account for higher startup memory use moby/moby#41168

Merged

cyphar mentioned this issue Aug 31, 2023

nsexec: cloned binary rework #3987

Merged

mwarkentin mentioned this issue Sep 13, 2024

Configurable minimum memory for all workloads kubernetes/autoscaler#7280

Closed

cyphar mentioned this issue Oct 8, 2024

runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver #4427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

Random-Liu commented Feb 13, 2019 •

edited

Loading

cyphar commented Feb 13, 2019 •

edited

Loading

Ace-Tang commented Feb 13, 2019

Random-Liu commented Feb 13, 2019 •

edited

Loading

crosbymichael commented Feb 13, 2019

giuseppe commented Feb 13, 2019

crosbymichael commented Feb 13, 2019

giuseppe commented Feb 13, 2019

keloyang commented Feb 14, 2019

cyphar commented Feb 14, 2019 •

edited

Loading

Random-Liu commented Feb 14, 2019 •

edited

Loading

cyphar commented Feb 14, 2019

Random-Liu commented Feb 14, 2019 •

edited

Loading

cyphar commented Feb 15, 2019 •

edited

Loading

cyphar commented Feb 15, 2019

Random-Liu commented Feb 15, 2019 •

edited

Loading

cgwalters commented Feb 15, 2019

cgwalters commented Feb 16, 2019

cyphar commented Feb 17, 2019 •

edited

Loading

lifubang commented Feb 18, 2019

cyphar commented Feb 18, 2019 •

edited

Loading

Random-Liu commented Feb 22, 2019 •

edited

Loading

lifubang commented Feb 22, 2019 •

edited

Loading

cyphar commented Feb 22, 2019 •

edited

Loading

lifubang commented Feb 22, 2019

justinsb commented Feb 22, 2019

cyphar commented Feb 22, 2019

Random-Liu commented Feb 26, 2019

cyphar commented Feb 26, 2019 •

edited

Loading

rhatdan commented Feb 26, 2019

cyphar commented Feb 26, 2019 •

edited

Loading

giuseppe commented Feb 26, 2019

cyphar commented Feb 26, 2019 •

edited

Loading

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

[CVE-2019-5736]: Runc uses more memory during start up after the fix #1980

Comments

Random-Liu commented Feb 13, 2019 • edited Loading

cyphar commented Feb 13, 2019 • edited Loading

Ace-Tang commented Feb 13, 2019

Random-Liu commented Feb 13, 2019 • edited Loading

Test Environment

Test Result

Conclusion

crosbymichael commented Feb 13, 2019

giuseppe commented Feb 13, 2019

crosbymichael commented Feb 13, 2019

giuseppe commented Feb 13, 2019

keloyang commented Feb 14, 2019

cyphar commented Feb 14, 2019 • edited Loading

Random-Liu commented Feb 14, 2019 • edited Loading

cyphar commented Feb 14, 2019

Random-Liu commented Feb 14, 2019 • edited Loading

cyphar commented Feb 15, 2019 • edited Loading

cyphar commented Feb 15, 2019

Random-Liu commented Feb 15, 2019 • edited Loading

cgwalters commented Feb 15, 2019

cgwalters commented Feb 16, 2019

cyphar commented Feb 17, 2019 • edited Loading

lifubang commented Feb 18, 2019

cyphar commented Feb 18, 2019 • edited Loading

Random-Liu commented Feb 22, 2019 • edited Loading

lifubang commented Feb 22, 2019 • edited Loading

cyphar commented Feb 22, 2019 • edited Loading

lifubang commented Feb 22, 2019

justinsb commented Feb 22, 2019

cyphar commented Feb 22, 2019

Random-Liu commented Feb 26, 2019

cyphar commented Feb 26, 2019 • edited Loading

rhatdan commented Feb 26, 2019

cyphar commented Feb 26, 2019 • edited Loading

giuseppe commented Feb 26, 2019

cyphar commented Feb 26, 2019 • edited Loading

Random-Liu commented Feb 13, 2019 •

edited

Loading

cyphar commented Feb 13, 2019 •

edited

Loading

Random-Liu commented Feb 13, 2019 •

edited

Loading

cyphar commented Feb 14, 2019 •

edited

Loading

Random-Liu commented Feb 14, 2019 •

edited

Loading

Random-Liu commented Feb 14, 2019 •

edited

Loading

cyphar commented Feb 15, 2019 •

edited

Loading

Random-Liu commented Feb 15, 2019 •

edited

Loading

cyphar commented Feb 17, 2019 •

edited

Loading

cyphar commented Feb 18, 2019 •

edited

Loading

Random-Liu commented Feb 22, 2019 •

edited

Loading

lifubang commented Feb 22, 2019 •

edited

Loading

cyphar commented Feb 22, 2019 •

edited

Loading

cyphar commented Feb 26, 2019 •

edited

Loading

cyphar commented Feb 26, 2019 •

edited

Loading

cyphar commented Feb 26, 2019 •

edited

Loading