Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

Update 2019 0308 #20

Merged
merged 15 commits into from
Mar 8, 2019
Merged

Update 2019 0308 #20

merged 15 commits into from
Mar 8, 2019

Commits on Mar 8, 2019

  1. factor out bind mount mountpoint creation

    During rootfs setup all mountpoints (directory and files) are created
    before bind mounting the bind mounts. This does not happen during
    container restore via CRIU. If restoring in an identical but newly created
    rootfs, the restore fails right now. This just factors out the code to
    create the bind mount mountpoints so that it also can be used during
    restore.
    
    Signed-off-by: Adrian Reber <areber@redhat.com>
    adrianreber authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    c75478a View commit details
    Browse the repository at this point in the history
  2. Create mountpoints also on restore

    runc creates all missing mountpoints when it starts a container, this
    commit also creates those mountpoints during restore. Now it is possible
    to restore a container using the same, but newly created rootfs just as
    during container start.
    
    Signed-off-by: Adrian Reber <areber@redhat.com>
    adrianreber authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    b38e523 View commit details
    Browse the repository at this point in the history
  3. Remove detection for scope properties, which have always been broken

    The detection for scope properties (whether scope units support
    DefaultDependencies= or Delegate=) has always been broken, since systemd
    refuses to create scopes unless at least one PID is attached to it (and
    this has been so since scope units were introduced in systemd v205.)
    
    This can be seen in journal logs whenever a container is started with
    libpod:
    
      Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
      Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
    
    Since this logic never worked, just assume both attributes are supported
    (which is what the code does when detection fails for this reason, since
    it's looking for an "unknown attribute" or "read-only attribute" to mark
    them as false) and skip the detection altogether.
    
    Signed-off-by: Filipe Brandenburger <filbranden@google.com>
    filbranden authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    325881f View commit details
    Browse the repository at this point in the history
  4. nsexec (CVE-2019-5736): avoid parsing environ

    My first attempt to simplify this and make it less costly focussed on
    the way constructors are called. I was under the impression that the ELF
    specification mandated that arg, argv, and actually even envp need to be
    passed to functions located in the .init_arry section (aka
    "constructors"). Actually, the specifications is (cf. [2]):
    
    SHT_INIT_ARRAY
    This section contains an array of pointers to initialization functions,
    as described in ``Initialization and Termination Functions'' in Chapter
    5. Each pointer in the array is taken as a parameterless procedure with
    a void return.
    
    which means that this becomes a libc specific decision. Glibc passes
    down those args, musl doesn't. So this approach can't work. However, we
    can at least remove the environment parsing part based on POSIX since
    [1] mandates that there should be an environ variable defined in
    unistd.h which provides access to the environment. See also the relevant
    Open Group specification [1].
    
    [1]: http://pubs.opengroup.org/onlinepubs/9699919799/
    [2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array
    
    Fixes: CVE-2019-5736
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Christian Brauner authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    821fc41 View commit details
    Browse the repository at this point in the history
  5. Vendor in go-criu

    Now that CRIU has released Go bindings, this commit vendors those in.
    
    At first it only replaces the copy of RPC interface but the goal is to
    use CRIU functions from the Go bindings instead of replicating the
    functionality in runc.
    
    Signed-off-by: Adrian Reber <areber@redhat.com>
    adrianreber authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    2032cf8 View commit details
    Browse the repository at this point in the history
  6. Use vendored in CRIU Go bindings

    This makes use of the vendored in Go bindings and removes the copy of
    the CRIU RPC interface definition. runc now relies on go-criu for RPC
    definition and hopefully more CRIU functions can be used in the future
    from the CRIU Go bindings.
    
    Signed-off-by: Adrian Reber <areber@redhat.com>
    adrianreber authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    b3e3811 View commit details
    Browse the repository at this point in the history
  7. switched travis to xenial

    The CRIU test for lazy migration was always skipped in Travis because
    the kernel was too old. This switches Travis testing to dist: xenial
    which provides a newer kernel which enables CRIU lazy migration testing.
    
    Signed-off-by: Adrian Reber <areber@redhat.com>
    adrianreber authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    93faef8 View commit details
    Browse the repository at this point in the history
  8. exec: expose --preserve-fds

    The implementation is already there, we only need to add the CLI
    option and pass it down.
    
    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
    giuseppe authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    a3227bd View commit details
    Browse the repository at this point in the history
  9. nsenter: cloned_binary: detect and handle short copies

    For a variety of reasons, sendfile(2) can end up doing a short-copy so
    we need to just loop until we hit the binary size. Since /proc/self/exe
    is tautologically our own binary, there's no chance someone is going to
    modify it underneath us (or changing the size).
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>
    cyphar authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    41fda55 View commit details
    Browse the repository at this point in the history
  10. fix preserve-fds flag may cause runc hang

    Signed-off-by: lifubang <lifubang@acmcoder.com>
    lifubang authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    1612b17 View commit details
    Browse the repository at this point in the history
  11. nsenter: cloned_binary: expand and add pre-3.11 fallbacks

    In order to get around the memfd_create(2) requirement, 0a8e411
    ("nsenter: clone /proc/self/exe to avoid exposing host binary to
    container") added an O_TMPFILE fallback. However, this fallback was
    flawed in two ways:
    
     * It required O_TMPFILE which is relatively new (having been added to
       Linux 3.11).
    
     * The fallback choice was made at compile-time, not runtime. This
       results in several complications when it comes to running binaries
       on different machines to the ones they were built on.
    
    The easiest way to resolve these things is to have fallbacks work in a
    more procedural way (though it does make the code unfortunately more
    complicated) and to add a new fallback that uses mkotemp(3).
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>
    cyphar authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    b4e1501 View commit details
    Browse the repository at this point in the history
  12. nsenter: cloned_binary: use the runc statedir for O_TMPFILE

    Writing a file to tmpfs actually incurs a memcg penalty, and thus the
    benefit of being able to disable memfd_create(2) with
    _LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
    be noted that quite a few distributions don't use tmpfs for /tmp (and
    instead have it as a regular directory or subvolume of the host
    filesystem).
    
    Since runc must have write access to the state directory anyway (and the
    state directory is usually not on a tmpfs) we can use that instead of
    /tmp -- avoiding potential memcg costs with no real downside.
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>
    cyphar authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    50734ca View commit details
    Browse the repository at this point in the history
  13. nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying

    The usage of memfd_create(2) and other copying techniques is quite
    wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
    memfd_create(2) added ~10M of memory usage to the cgroup associated with
    the container, which can result in some setups getting OOM'd (or just
    hogging the hosts' memory when you have lots of created-but-not-started
    containers sticking around).
    
    The easiest way of solving this is by creating a read-only bind-mount of
    the binary, opening that read-only bindmount, and then umounting it to
    ensure that the host won't accidentally be re-mounted read-write. This
    avoids all copying and cleans up naturally like the other techniques
    used. Unfortunately, like the O_TMPFILE fallback, this requires being
    able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
    over the most obvious path -- /proc/self/exe -- is a *very bad idea*).
    
    Unfortunately detecting this isn't fool-proof -- on a system with a
    read-only root filesystem (that might become read-write during "runc
    init" execution), we cannot tell whether we have already done an ro
    remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
    environment variable which is checked *alongside* the protection being
    present.
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>
    cyphar authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    edfc1d9 View commit details
    Browse the repository at this point in the history
  14. nsenter: cloned_binary: userspace copy fallback if sendfile fails

    There are some circumstances where sendfile(2) can fail (one example is
    that AppArmor appears to block writing to deleted files with sendfile(2)
    under some circumstances) and so we need to have a userspace fallback.
    It's fairly trivial (and handles short-writes).
    
    Signed-off-by: Aleksa Sarai <asarai@suse.de>
    cyphar authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    e97c6c5 View commit details
    Browse the repository at this point in the history
  15. README: link to /org/security/

    Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
    vbatts authored and Ace-Tang committed Mar 8, 2019
    Configuration menu
    Copy the full SHA
    5ac9a74 View commit details
    Browse the repository at this point in the history