Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

Update 2019 0308 #20

Merged
merged 15 commits into from
Mar 8, 2019
Merged

Update 2019 0308 #20

merged 15 commits into from
Mar 8, 2019

Conversation

Ace-Tang
Copy link
Collaborator

@Ace-Tang Ace-Tang commented Mar 8, 2019

Update to opencontainer/runc, the full diff please refer:
opencontainers/runc@f661e02...2b18fe1

adrianreber and others added 15 commits March 8, 2019 09:42
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):

SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.

which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array

Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Now that CRIU has released Go bindings, this commit vendors those in.

At first it only replaces the copy of RPC interface but the goal is to
use CRIU functions from the Go bindings instead of replicating the
functionality in runc.

Signed-off-by: Adrian Reber <areber@redhat.com>
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
The CRIU test for lazy migration was always skipped in Travis because
the kernel was too old. This switches Travis testing to dist: xenial
which provides a newer kernel which enables CRIU lazy migration testing.

Signed-off-by: Adrian Reber <areber@redhat.com>
The implementation is already there, we only need to add the CLI
option and pass it down.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: lifubang <lifubang@acmcoder.com>
In order to get around the memfd_create(2) requirement, 0a8e411
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

 * It required O_TMPFILE which is relatively new (having been added to
   Linux 3.11).

 * The fallback choice was made at compile-time, not runtime. This
   results in several complications when it comes to running binaries
   on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 7 committers have signed the CLA.

✅ lifubang
❌ adrianreber
❌ filbranden
❌ brauner
❌ giuseppe
❌ cyphar
❌ vbatts
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants