Skip to content

Latest commit

 

History

History
992 lines (809 loc) · 34.6 KB

README.md

File metadata and controls

992 lines (809 loc) · 34.6 KB

ccon

A single binary to handle basic container creation. The goal is to produce a lightweight tool in C that can serve as a test-bed for Open Container Intiative Runtime Specification development. Ccon is thin wrapper around the underlying syscalls and kernel primitives. It makes it easy to apply a given configuration, but does not have an opinion about what a container should look like (it's even less opinionated than LXC).

Table of contents

Lifecycle

When you invoke it from the command line, ccon clones a child process to create any new namespaces declared in the config file. The parent process continues running in the host namespace. When the child process exits, the host process collects its exit status and returns it to the caller. During an initial setup phase, the two processes pass messages on a Unix socket to synchronize the container setup. Here's an outline of the lifecycle:

Host process Container process
opens host executable
opens namespace files
clones child → (clone unshares namespaces)
sets user-ns mappings blocks on user-ns mappings
sends mappings-complete →
blocks on full namespace joins namespaces
mounts filesystems
← sends namespaces-complete
runs post-create hooks blocks on exec-message
binds to socket path
sends connection socket →
blocks on exec-process message listens for process JSON
← sends exec-process message
removes socket path opens the local ptmx
← sends pseudoterminal master
bind mounts /dev/console
← sends pseudoterminal slave
waits on child death executes user process
splicing standard streams
onto the pseduoterminal master
dies
collects child exit code
runs post-stop hooks
exits with child's code

A number of those steps are optional. For details, see the relevant section in the configuration specification. In general, leaving out a particular value (e.g. namespaces.user.setgroups or namespaces.mount.mounts) will result in that potential action (e.g. writing to /proc/{pid}/setgroups or calling mount) being skipped, while the rest of ccon carries on as usual.

Users who need to join namespaces before unsharing namespaces can use nsenter or a wrapping ccon invocation to join those namespaces before the main ccon invocation creates the new mount namespace.

Socket communication

With --socket=PATH, ccon will bind a SOCK_SEQPACKET Unix socket to PATH. This path is created after namespace-setup completes, so users can use its presence as a trigger for further configuration (e.g. network setup) before starting the user-specified code. The path is removed after a start request is received or after the container process exits, whichever comes first.

The ccon-cli program distributed with this repository is one client for the ccon socket.

Getting the container process's PID

An SO_PEERCRED request will return the container process's PID in the receiving process's PID namespace. The client can use this to look up the container process in their local /proc. This request may be performed as many times as you like.

Start request

The request is a single struct iovec containing either a leading null byte or process JSON. Sending a single null-byte message will trigger the process field present in the original configuration, while non-empty strings will completely override that field.

The response is a single struct iovec containing either a single null-byte message (for success) or an error message encoded in ASCII (RFC 1345). In this context, “success” means “successfully received the start request”, because the container process sends the response before actually executing the user-specified code.

If you set host in your process JSON, ccon-cli will open the referenced path and pass the open file descriptor to the container over the Unix socket.

Example

In one shell, launch ccon and have it listen on a socket at /tmp/ccon-sock:

$ ccon --socket /tmp/ccon-sock

In a second shell, get the container process's PID, but don't trigger the user-specified code:

$ PID=$(ccon-cli --socket /tmp/ccon-sock --pid)
$ echo "${PID}"
2186

You can then perform additional configuration using that PID:

$ ip link set ccon-ex-veth1 netns "${PID}"

And when you're finished setting up the environment, you can trigger the user-specified code:

$ ccon-cli  --socket /tmp/ccon-sock --config-string '{"args": ["busybox", "sh"]}'

Configuration

Ccon is similar to an Open Container Iniative Runtime Specification runtime in that it reads a configuration file named config.json from its current working directory. However the JSON content is a bit different to highlight how the components relate to each-other on Linux. For example, setting per-container mounts requires a mount namespace, so ccon's mount listing falls under namespaces.mount.mounts. There's an example in config.json that unprivileged users should be able to use to launch an interactive BusyBox shell in new namespaces (you may need to adjust the hostID entries to match id -u and id -g).

You can load the configuration from a different file by giving its path with the --config option. For example:

$ ccon --config path/to/config.json

or:

$ ccon --config /dev/fd/4 4<path/to/config.json

or (using Bash's process substitution):

$ ccon --config <(echo '{"version": "0.5.0", "process": …}')

You can also specify the config JSON directly on the command line with --config-string, which may be convenient in situations where using pipes or process substitution are too awkward:

$ ccon --config-string '{"version": "0.5.0", "process": …}'

There are additional examples focusing on specific tasks in the examples/ directory.

Version

The ccon version represented in the config file.

Example

"version": "0.5.0"

Namespaces

A set of namespaces to be created or joined by the container process. Keys match the long-form options from unshare and nsenter without their leading hyphens. For each namespace entry, the presence of a path key means the container process will join an existing namespace at the absolute path specified by the path value. The absence of a path key means a new namespace will be created. There may be additional per-namespace configuration in the namespace object. If there is no namespaces entry or its value is an empty object, the container process will inherit all its namespaces from the host process. Similarly, if a particular namespaces entry is missing (e.g. user), the container process will inherit that namespace from the host process.

  • namespaces (optional, object) containing entries for each new or joined namespace.

Example

"namespaces": {
  "uts": {},
  "net": {"path": "/proc/2186/ns/net"},
  "user": {"setgroups": false}
}

Which will create new UTS and user namespaces, join the network namespace at /proc/2186/ns/net, and disable setgroups in the new user namespace.

User namespace

New user namespaces support the /proc/{pid}/{path} files setgroups, uid_map, and gid_map discussed in user_namespaces(7).

  • user (optional, object) which may contain:
    • path (optional, string) the absolute path to a network namespace which the container process should join.
    • setgroups (optional, boolean) whether to enable or disable setgroups. Implemented by writing to /proc/{pid}/setgroups.
    • uidMappings (optional, array of objects) maps user IDs between the new namespace and its parent namespace. Implemented by writing to /proc/{pid}/uid_map. Array entries are objects with the following fields:
      • containerID (required, integer) is the start of the mapped UID range in the new namespace.
      • hostID (required, integer) is the start of the mapped UID range in the parent namespace.
      • size (required, integer) is the length of the range of mapped UIDs.
    • gidMappings (optional, array of objects) maps group IDs between the new namespace and its parent namespace. Implemented by writing to /proc/{pid}/gid_map. Array entries are objects with the following fields:
      • containerID (required, integer) is the start of the mapped GID range in the new namespace.
      • hostID (required, integer) is the start of the mapped GID range in the parent namespace.
      • size (required, integer) is the length of the range of mapped GIDs.

Debian disables unprivileged user namespaces by default to reduce the risk of exploits based on kernel bugs. If you are comfortable assuming those risks, you can enable it with:

# sysctl kernel.unprivileged_userns_clone=1
Example
"user": {
  "setgroups": false,
  "uidMappings": [
    {
      "containerID": 0,
      "hostID": 1000,
      "size": 1
    }
  ],
  "gidMappings": [
    {
      "containerID": 0,
      "hostID": 1000,
      "size": 1
    }
  ]
},

Which will disable setgroups and map the host user and group 1000 to the container user and group 0.

Mount namespace

New mount namespace support the creation of arbitrary mounts, assuming the caller has sufficient privileges for the underlying syscall. The user namepace documentation outlines the mount permissions for processes inside a user namespace.

  • mount (optional, object) which may contain:
    • path (optional, string) the absolute path to a network namespace which the container process should join.
    • mounts (optional, array) an ordered list of mounts to perform. Array entries are objects with fields based on the mount call:
      • type (string) of mount (see filesystems(5)).
      • source (string) path of mount. This may be optional or required depending on type.
      • target (string, required) path of the mount being created or manipulated.
      • flags (array of strings, optional) MS_* flags to set.
      • data (string, optional) type-specific data for the mount.

If they don't start with a slash, source and target are interpreted as paths relative to ccon's current working directory.

If target does not exist, ccon will attempt to create it by calling mkdir, making multiple calls if necessary. For bind mounts where source is set to a non-directory and target does not exit, ccon will create an empty file at target to mount over.

In addition to the usual types supported by mount, ccon supports a pivot-root type that invokes the pivot_root syscall, shifting the old root to a temporary (after which it is unmounted and the temporary directory is removed). In that case, the only other field that matters is source, which specifies

Example
"mount": {
  "mounts": [
    {
      "source": "rootfs",
      "target": "rootfs",
      "flags": [
        "MS_BIND"
      ]
    },
    {
      "source": "/etc/resolv.conf",
      "target": "rootfs/etc/resolv.conf",
      "flags": [
        "MS_BIND"
      ]
    },
    {
      "source": "root",
      "target": "rootfs/root",
      "flags": [
        "MS_BIND"
      ]
    },
    {
      "source": "rootfs",
      "type": "pivot-root"
    }
  ]
}

Which will bind ${PWD}/rootfs to itself (the “trick” mentioned in switch_root(8) which we need for the later pivot), bind the host's resolv.conf onto ${PWD}/rootfs/etc/resolv.conf, bind ${PWD}/root onto ${PWD}/rootfs/root, and pivot to make ${PWD}/rootfs the container root.

PID namespace

There is no special configuration for the PID namespace, although if you are creating both a PID and a mount namespace, you probably want mount entries along the lines of:

{
  "target": "/proc",
  "flags": [
    "MS_PRIVATE",
    "MS_REC"
  ]
},
{
  "target": "/proc",
  "type": "proc",
  "flags": [
    "MS_NOSUID",
    "MS_NOEXEC",
    "MS_NODEV"
  ]
}

For more details, see the “/proc and PID namespaces” section of pid_namespaces(7).

  • pid (optional, object) which may contain:
    • path (optional, string) the absolute path to a PID namespace which the container process should join.

Network namespace

There is no special configuration for the network namespace.

  • net (optional, object) which may contain:
    • path (optional, string) the absolute path to a network namespace which the container process should join.

IPC namespace

There is no special configuration for the IPC namespace.

  • ipc (optional, object) which may contain:
    • path (optional, string) the absolute path to an IPC namespace which the container process should join.

UTS namespace

There is no special configuration for the UTS namespace, although future work might build in support for sethostname.

  • uts (optional, object) which may contain:
    • path (optional, string) the absolute path to a UTS namespace which the container process should join.

Cgroup namespace

There is no special configuration for the cgroup namespace.

  • cgroup (optional, object) which may contain:
    • path (optional, string) the absolute path to an IPC namespace which the container process should join.

Console

Some applications (including systemd) require a TTY at /dev/console. This setting allows you to provide that console without duping over the container process's standard streams.

For more details on why using the container's /dev/ptmx is important, see the process.terminal documentation.

Process

After the container setup is finished, the container process can optionally adjust its state and execute the configured code. If process isn't specified, the container process will exit (with an exit code of zero) instead of executing a user process (which can be useful for the creation phase of a workflow that separates creation from execution).

  • process (optional, object) configuring the container process after the container is setup.

Example

"process": {
  "args": ["busybox", "sh"]
}

Which will execvpe a BusyBox shell with the host process's user and group (possibly mapped by the user namespace), working directory, and environment.

Terminal

If you launch ccon from a terminal (e.g. tty or test -t 0 return zero), your standard input is already a terminal and you probably don't need to worry about this setting. If you launch ccon from a non-terminal process (e.g. from a webserver that is communicating with the user over a socket), you may want to create a UNIX 98 psuedoterminal to do things like translate the user's control-C into SIGINT for the container.

Containers that do not pivot root or who otherwise keep access to the host ptmx can create such a pseudoterminal by calling opening the ptmx (e.g. with posix_openpt).

Containers that are pivoting to a new root and mounting their devpts with newinstance will want to ensure that the pseudoterminal is created using a devpts instance that will be accessible after the pivot, and there are a number of issues to consider.

  • terminal (optional, boolean) if true, the process will open its local /dev/ptmx (e.g. with posix_openpt), grant access to the slave with grantpt, dup the pseudoterminal slave over its standard streams, and send the pseudoterminal master back to the host process. The host process will continually copy its standard input to that pseudoterminal master and the pseudoterminal master to its standard output. If console is also true, the same pseudoterminal will be used for both /dev/console and the container process's standard streams.

Before 77356912 (included in version 2.23, released 2016-02-19), glibc's grantpt was more agressive about changing the pseudterminal slave's group, which could fail for unprivileged users. Unprivileged users linking older versions of glibc can work around the old behavior by ensuring tty is not defined in the /etc/group visible from the container's mount namespace.

Example
"args": ["sh"],
"terminal": true

User

Adjust the user and group IDs before executing the user-specified code.

Example
"user": {
  "uid": 0,
  "gid": 0,
  "additionalGids": [5, 6]
}

Which will lead to a container process with id output like:

uid=0(root) gid=0(root) groups=0(root),5(tty),6(disk)

Current working directory

Change to a different directory before executing the configured code.

  • cwd (optional, string) to chdir to a different directory. If unset, the current directory will remain the same as the caller's working directory, unless there is a pivot-root entry in namespaces.mount.mounts, in which case the default working directory will be the new root.
Example
"cwd": "/root"

Capabilities

Define the minimum set of capabilities required for the container process. All other capabilities are dropped from all capabilities sets, including the bounding set, before executing the configured code.

  • capabilities (optional, array of strings) Set of CAP_* flags to set.

If unset, the container process will continue with the caller's capabilities (potentially increased in a child user namespace).

Example
"capabilities": [
  "CAP_NET_BIND_SERVICE",
  "CAP_NET_RAW"
]

Arguments

The command that the container process executes after container setup is complete. The process will inherit any open file descriptors; for example the standard streams (unless terminal is true) or systemd's SD_LISTEN_FDS_START.

  • args (optional, array of strings) holds command-line arguments passed to execvpe. The first argument (args[0]) is also used as the path, unless path is set.

If unset, the container process will exit with status zero instead of executing new code (see Process).

Example
"args": [
  "nginx",
  "-c",
  "/nginx.conf"
]

Which will execute an Nginx server using the configuration in /nginx.conf.

Path

Override args[0] with an alternate path (but the executed code will still see args[0] as its first argument).

  • path (optional, string) sets the path to the executed command. Paths without slashes will be resolved using the PATH environment variable.
Example
"args": ["sh"],
"path": "busybox"

Which will execute the first busybox executable found in your PATH with its argv[0] set to sh.

Host

Instead of looking up args[0] (or path) in the container mount namespace, look it up in the host mount namespace using the host PATH. This allows you to launch (via execveat, so you need Linux 3.19+) a statically-linked init process that only exists on the host.

  • host (optional, boolean) lookup args[0] (or path) in the host mount namespace using the host PATH.
Example
"args": ["sh"],
"path": "busybox",
"host": true

Which will execute the first busybox executable found in your PATH with its argv[0] set to sh.

Environment variables

Override the host environment.

  • env (optional, array of strings) holds environment settings for execvpe.

If unset, the container process will use the environ it inherited from the host.

Example
"env": [
  "PATH=/bin:/usr/bin",
  "TERM=xterm"
]

Which will set PATH and TERM.

Hooks

Not all container-related functionality is built into ccon (the only setup handled by the host process is the /proc/{pid}/setgroups, etc., writes for user namespaces. For example, control group manipulation and veth network configuration should be handled with external tools. What ccon provides are hooks so you can call those external tools at the appropriate point in the lifecycle.

  • hooks (optional, object) configuring the hooks run for each hook-triggering event.

Example

"hooks": {
  "post-create": [
    {
      "args": [
        "echo",
        "I'm a post-create"
      ]
    }
  ],
  "post-stop": [
    {
      "args": [
        "echo",
        "I'm a post-stop hook"
      ]
    }
  ]
}

Which will just print messages to the host process's stdout for each hook-triggering event.

Post-create hooks

Hooks run after the container setup is complete but before the configured process is executed. With --socket=PATH these are run just before the socket path is created. This is useful for additional container configuration (e.g. creating cgroups or performing network setup).

  • post-create (optional, array of objects) holds process objects (like process except for stdin handling and the lack of host) to run after the post-create event.

Each hook receives the container process's PID in the host PID namespace on its stdin. Its stdout and stderr are inherited from the host process (unless terminal is true). The hooks are executed in the listed order, the host process waits until each hook exits before executing the next, and a nonzero exit code from any hook will cause the host process to abandon further hook execution, SIGKILL the container process. The host process resumes the usual lifecycle at “waits on child death”.

Example

"post-create": [
  {
    "args": [
      "mkdir",
      "-p",
      "/sys/fs/cgroup/unified/nginx-0/container"
    ]
  },
  {
    "args": [
      "tee",
      "/sys/fs/cgroup/unified/nginx-0/container/cgroup.procs"
    ]
  }
]

Which will create new nginx-0 and nginx-0/container cgroups in the unified hierarchy (if they don't already exist) and add the container process to that cgroup.

Post-stop hooks

Hooks run after the host process has reaped the container process. You could handle this in the shell with:

$ ccon; post_stop_hook_1; post_stop_hook_2

but the most common use will be cleaning up after post-create hooks, and it's nice to configure both in the same place (the ccon config file).

  • post-stop (optional, array of objects) holds process objects (like process except for the lack of host) to run after the post-stop event.

Its standard streams are inherited from the host process (unless terminal is true). The hooks are executed in the listed order, the host process waits until each hook exits before executing the next, and a nonzero exit code from any hook will cause the host process to print a message to stderr, after which it continues as if the hook had exited with zero.

Example

"post-stop": [
  {
    "args": [
      "rmdir",
      "/sys/fs/cgroup/unified/nginx-0/container"
    ]
  },
  {
    "args": [
      "rmdir",
      "/sys/fs/cgroup/unified/nginx-0"
    ]
  }
]

Which will remove nginx-0/container and nginx-0 cgroups (such as those created by the post-create example. This will only succeed if the namespaces are empty, so if you were using this in production it would be best to:

  • Ensure there were no other processes in those cgroups (e.g. by creating a new PID namespace and adding all additional processes to that namespace before adding them to the nginx-0 cgroup tree)
  • Use a tool like cgdelete to recursively remove nginx-0, which would also remove additional child cgroups beyond nginx-0/container that may have been added by other processes since nginx-0 was created.

Dependencies

Build dependencies

Ccon is pretty easy to compile, but to use the stock Makefile, you'll need:

Development dependencies

Licensing

Because all the dependencies are GPL-compatible, ccon binaries can be distributed under the GPLv3+.