A single binary to handle basic container creation. The goal is to produce a lightweight tool in C that can serve as a test-bed for Open Container Intiative Runtime Specification development. Ccon is thin wrapper around the underlying syscalls and kernel primitives. It makes it easy to apply a given configuration, but does not have an opinion about what a container should look like (it's even less opinionated than LXC).
When you invoke it from the command line, ccon clone
s a
child process to create any new namespaces declared in the config
file. The parent process continues running in the host namespace.
When the child process exits, the host process collects its exit
status and returns it to the caller. During an initial setup phase,
the two processes pass messages on a Unix socket to
synchronize the container setup. Here's an outline of the lifecycle:
Host process | Container process |
---|---|
opens host executable | |
opens namespace files | |
clones child → | (clone unshares namespaces) |
sets user-ns mappings | blocks on user-ns mappings |
sends mappings-complete → | |
blocks on full namespace | joins namespaces |
mounts filesystems | |
← sends namespaces-complete | |
runs post-create hooks | blocks on exec-message |
binds to socket path | |
sends connection socket → | |
blocks on exec-process message | listens for process JSON |
← sends exec-process message | |
removes socket path | opens the local ptmx |
← sends pseudoterminal master | |
bind mounts /dev/console |
|
← sends pseudoterminal slave | |
waits on child death | executes user process |
splicing standard streams | … |
onto the pseduoterminal master | |
dies | |
collects child exit code | |
runs post-stop hooks | |
exits with child's code |
A number of those steps are optional. For details, see the relevant
section in the configuration specification. In
general, leaving out a particular value
(e.g. namespaces.user.setgroups
or
namespaces.mount.mounts
) will result in that potential action
(e.g. writing to /proc/{pid}/setgroups
or
calling mount
) being skipped, while the rest of ccon
carries on as usual.
Users who need to join namespaces before unsharing namespaces can
use nsenter
or a wrapping ccon invocation to join those
namespaces before the main ccon invocation creates the new mount
namespace.
With --socket=PATH
, ccon will bind a SOCK_SEQPACKET
Unix
socket to PATH
. This path is created after namespace-setup
completes, so users can use its presence as a trigger for further
configuration (e.g. network setup) before starting
the user-specified code. The path is removed after a
start request is received or after the container
process exits, whichever comes first.
The ccon-cli
program distributed with this repository
is one client for the ccon socket.
An SO_PEERCRED
request will return the container
process's PID in the receiving process's PID
namespace. The client can use this to look up the
container process in their local /proc
. This request may
be performed as many times as you like.
The request is a single struct iovec
containing either a
leading null byte or process JSON. Sending a single null-byte message
will trigger the process
field present in the
original configuration, while non-empty strings will completely
override that field.
The response is a single struct iovec
containing either a
single null-byte message (for success) or an error message encoded in
ASCII (RFC 1345). In this context, “success”
means “successfully received the start request”, because the container
process sends the response before actually executing the
user-specified code.
If you set host
in your process JSON, ccon-cli
will
open the referenced path and pass the open file descriptor to the
container over the Unix socket.
In one shell, launch ccon and have it listen on a socket at
/tmp/ccon-sock
:
$ ccon --socket /tmp/ccon-sock
In a second shell, get the container process's PID, but don't trigger the user-specified code:
$ PID=$(ccon-cli --socket /tmp/ccon-sock --pid)
$ echo "${PID}"
2186
You can then perform additional configuration using that PID:
$ ip link set ccon-ex-veth1 netns "${PID}"
And when you're finished setting up the environment, you can trigger the user-specified code:
$ ccon-cli --socket /tmp/ccon-sock --config-string '{"args": ["busybox", "sh"]}'
Ccon is similar to an Open Container Iniative Runtime
Specification runtime in that it reads a configuration
file named config.json
from its current working directory. However
the JSON content is a bit different to highlight how the components
relate to each-other on Linux. For example, setting per-container
mounts requires a mount namespace, so ccon's mount listing falls under
namespaces.mount.mounts
. There's an example in
config.json
that unprivileged users should be able to
use to launch an interactive BusyBox shell in new namespaces (you
may need to adjust the hostID
entries to match id -u
and id -g
).
You can load the configuration from a different file by giving its
path with the --config
option. For example:
$ ccon --config path/to/config.json
or:
$ ccon --config /dev/fd/4 4<path/to/config.json
or (using Bash's process substitution):
$ ccon --config <(echo '{"version": "0.5.0", "process": …}')
You can also specify the config JSON directly on the command line with
--config-string
, which may be convenient in situations where using
pipes or process substitution are too awkward:
$ ccon --config-string '{"version": "0.5.0", "process": …}'
There are additional examples focusing on specific tasks in the
examples/
directory.
The ccon version represented in the config file.
version
(required, SemVer 2.0.0 string)
"version": "0.5.0"
A set of namespaces to be created or joined by the container process.
Keys match the long-form options from unshare
and
nsenter
without their leading hyphens. For each
namespace entry, the presence of a path
key means the container
process will join an existing namespace at the absolute path specified
by the path
value. The absence of a path
key means a new
namespace will be created. There may be additional per-namespace
configuration in the namespace object. If there is no
namespaces
entry or its value is an empty object, the container
process will inherit all its namespaces from the host process.
Similarly, if a particular namespaces
entry is missing
(e.g. user
), the container process will
inherit that namespace from the host process.
namespaces
(optional, object) containing entries for each new or joined namespace.
"namespaces": {
"uts": {},
"net": {"path": "/proc/2186/ns/net"},
"user": {"setgroups": false}
}
Which will create new UTS and
user namespaces, join the network namespace at
/proc/2186/ns/net
, and disable setgroups
in the new
user namespace.
New user namespaces support the
/proc/{pid}/{path}
files setgroups
, uid_map
, and gid_map
discussed in user_namespaces(7)
.
user
(optional, object) which may contain:path
(optional, string) the absolute path to a network namespace which the container process should join.setgroups
(optional, boolean) whether to enable or disablesetgroups
. Implemented by writing to/proc/{pid}/setgroups
.uidMappings
(optional, array of objects) maps user IDs between the new namespace and its parent namespace. Implemented by writing to/proc/{pid}/uid_map
. Array entries are objects with the following fields:containerID
(required, integer) is the start of the mapped UID range in the new namespace.hostID
(required, integer) is the start of the mapped UID range in the parent namespace.size
(required, integer) is the length of the range of mapped UIDs.
gidMappings
(optional, array of objects) maps group IDs between the new namespace and its parent namespace. Implemented by writing to/proc/{pid}/gid_map
. Array entries are objects with the following fields:containerID
(required, integer) is the start of the mapped GID range in the new namespace.hostID
(required, integer) is the start of the mapped GID range in the parent namespace.size
(required, integer) is the length of the range of mapped GIDs.
Debian disables unprivileged user namespaces by default to reduce the risk of exploits based on kernel bugs. If you are comfortable assuming those risks, you can enable it with:
# sysctl kernel.unprivileged_userns_clone=1
"user": {
"setgroups": false,
"uidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
],
"gidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
]
},
Which will disable setgroups
and map the host user
and group 1000 to the container user and group 0.
New mount namespace support the creation of arbitrary mounts, assuming the caller has sufficient privileges for the underlying syscall. The user namepace documentation outlines the mount permissions for processes inside a user namespace.
mount
(optional, object) which may contain:path
(optional, string) the absolute path to a network namespace which the container process should join.mounts
(optional, array) an ordered list of mounts to perform. Array entries are objects with fields based on themount
call:type
(string) of mount (seefilesystems(5)
).source
(string) path of mount. This may be optional or required depending ontype
.target
(string, required) path of the mount being created or manipulated.flags
(array of strings, optional)MS_*
flags to set.data
(string, optional) type-specific data for the mount.
If they don't start with a slash, source
and target
are
interpreted as paths relative to ccon's current working
directory.
If target
does not exist, ccon will attempt to create it by
calling mkdir
, making multiple calls if necessary. For
bind mounts where source
is set to a non-directory and
target
does not exit, ccon will create an empty file at
target
to mount over.
In addition to the usual types supported by mount
, ccon
supports a pivot-root
type
that invokes the
pivot_root
syscall, shifting the old
root to a temporary (after which it is unmounted and the temporary
directory is removed). In that case, the only other field that
matters is source
, which specifies
"mount": {
"mounts": [
{
"source": "rootfs",
"target": "rootfs",
"flags": [
"MS_BIND"
]
},
{
"source": "/etc/resolv.conf",
"target": "rootfs/etc/resolv.conf",
"flags": [
"MS_BIND"
]
},
{
"source": "root",
"target": "rootfs/root",
"flags": [
"MS_BIND"
]
},
{
"source": "rootfs",
"type": "pivot-root"
}
]
}
Which will bind ${PWD}/rootfs
to itself (the “trick” mentioned in
switch_root(8)
which we need for the later
pivot), bind the host's resolv.conf
onto
${PWD}/rootfs/etc/resolv.conf
, bind ${PWD}/root
onto
${PWD}/rootfs/root
, and pivot to make ${PWD}/rootfs
the container
root.
There is no special configuration for the PID namespace, although if you are creating both a PID and a mount namespace, you probably want mount entries along the lines of:
{
"target": "/proc",
"flags": [
"MS_PRIVATE",
"MS_REC"
]
},
{
"target": "/proc",
"type": "proc",
"flags": [
"MS_NOSUID",
"MS_NOEXEC",
"MS_NODEV"
]
}
For more details, see the “/proc and PID namespaces” section of
pid_namespaces(7)
.
pid
(optional, object) which may contain:path
(optional, string) the absolute path to a PID namespace which the container process should join.
There is no special configuration for the network namespace.
net
(optional, object) which may contain:path
(optional, string) the absolute path to a network namespace which the container process should join.
There is no special configuration for the IPC namespace.
ipc
(optional, object) which may contain:path
(optional, string) the absolute path to an IPC namespace which the container process should join.
There is no special configuration for the UTS
namespace, although future work might build in support
for sethostname
.
uts
(optional, object) which may contain:path
(optional, string) the absolute path to a UTS namespace which the container process should join.
There is no special configuration for the cgroup namespace.
cgroup
(optional, object) which may contain:path
(optional, string) the absolute path to an IPC namespace which the container process should join.
console
(optional, boolean) if true, the container process will open its local/dev/ptmx
(e.g. withposix_openpt
), grant access to the slave withgrantpt
, bindmount
the pseudoterminal slave to/dev/console
, and send both the pseudoterminal master and slave back to the host process. The host process will continually copy its standard input to that pseudoterminal master and the pseudoterminal master to its standard output. Ifprocess.terminal
is also true, the same pseudoterminal will be used for both/dev/console
and the container process's standard streams.
Some applications (including systemd)
require a TTY at /dev/console
. This setting allows you to provide
that console without dup
ing over the container process's
standard streams.
For more details on why using the container's /dev/ptmx
is
important, see the process.terminal
documentation.
After the container setup is finished, the container process can
optionally adjust its state and execute the configured code. If
process
isn't specified, the container process will exit (with
an exit code of zero) instead of executing a user process (which can
be useful for the creation phase of a workflow that separates creation
from execution).
process
(optional, object) configuring the container process after the container is setup.
"process": {
"args": ["busybox", "sh"]
}
Which will execvpe
a BusyBox shell with the host
process's user and group (possibly mapped by the user
namespace), working directory, and environment.
If you launch ccon from a terminal (e.g. tty
or test -t 0
return zero), your standard input is already a
terminal and you probably don't need to worry about this setting. If
you launch ccon from a non-terminal process (e.g. from a webserver
that is communicating with the user over a socket), you may want to
create a UNIX 98 psuedoterminal to do things like translate
the user's control-C into SIGINT
for the container.
Containers that do not pivot root or who otherwise
keep access to the host ptmx can create such a pseudoterminal
by calling opening the ptmx (e.g. with
posix_openpt
).
Containers that are pivoting to a new root and mounting their devpts with newinstance will want to ensure that the pseudoterminal is created using a devpts instance that will be accessible after the pivot, and there are a number of issues to consider.
terminal
(optional, boolean) if true, the process will open its local/dev/ptmx
(e.g. withposix_openpt
), grant access to the slave withgrantpt
,dup
the pseudoterminal slave over its standard streams, and send the pseudoterminal master back to the host process. The host process will continually copy its standard input to that pseudoterminal master and the pseudoterminal master to its standard output. Ifconsole
is also true, the same pseudoterminal will be used for both/dev/console
and the container process's standard streams.
Before 77356912 (included in version 2.23, released
2016-02-19), glibc's grantpt
was more agressive
about changing the pseudterminal slave's group, which could fail for
unprivileged users. Unprivileged users linking
older versions of glibc can work around the old behavior by ensuring
tty
is not defined in the /etc/group
visible from the container's
mount namespace.
"args": ["sh"],
"terminal": true
Adjust the user and group IDs before executing the user-specified code.
uid
(optional, integer) tosetuid
a different user.gid
(optional, integer) tosetgid
a different group.additionalGids
(optional, array of integers) forsetgroups
. See alsonamespaces.user.setgroups
.
"user": {
"uid": 0,
"gid": 0,
"additionalGids": [5, 6]
}
Which will lead to a container process with id
output like:
uid=0(root) gid=0(root) groups=0(root),5(tty),6(disk)
Change to a different directory before executing the configured code.
cwd
(optional, string) tochdir
to a different directory. If unset, the current directory will remain the same as the caller's working directory, unless there is apivot-root
entry innamespaces.mount.mounts
, in which case the default working directory will be the new root.
"cwd": "/root"
Define the minimum set of capabilities required for the container process. All other capabilities are dropped from all capabilities sets, including the bounding set, before executing the configured code.
capabilities
(optional, array of strings) Set ofCAP_*
flags to set.
If unset, the container process will continue with the caller's capabilities (potentially increased in a child user namespace).
"capabilities": [
"CAP_NET_BIND_SERVICE",
"CAP_NET_RAW"
]
The command that the container process executes after container setup
is complete. The process will inherit any open file descriptors; for
example the standard streams (unless
terminal
is true) or systemd's
SD_LISTEN_FDS_START
.
args
(optional, array of strings) holds command-line arguments passed toexecvpe
. The first argument (args[0]
) is also used as the path, unlesspath
is set.
If unset, the container process will exit with status zero instead of executing new code (see Process).
"args": [
"nginx",
"-c",
"/nginx.conf"
]
Which will execute an Nginx server using the configuration in
/nginx.conf
.
Override args[0]
with an alternate path (but the executed code
will still see args[0]
as its first argument).
path
(optional, string) sets the path to the executed command. Paths without slashes will be resolved using thePATH
environment variable.
"args": ["sh"],
"path": "busybox"
Which will execute the first busybox
executable found in
your PATH
with its argv[0]
set to sh
.
Instead of looking up args[0]
(or
path
) in the container mount namespace, look it up in
the host mount namespace using the host PATH
. This allows you to
launch (via execveat
, so you need Linux
3.19+) a statically-linked init process that
only exists on the host.
"args": ["sh"],
"path": "busybox",
"host": true
Which will execute the first busybox
executable found in
your PATH
with its argv[0]
set to sh
.
Override the host environment.
env
(optional, array of strings) holds environment settings forexecvpe
.
If unset, the container process will use the environ
it inherited from the host.
"env": [
"PATH=/bin:/usr/bin",
"TERM=xterm"
]
Which will set PATH
and TERM
.
Not all container-related functionality is built into ccon (the only
setup handled by the host process is the /proc/{pid}/setgroups
,
etc., writes for user namespaces. For example,
control group manipulation and veth network
configuration should be handled with external tools.
What ccon provides are hooks so you can call those external tools at
the appropriate point in the lifecycle.
hooks
(optional, object) configuring the hooks run for each hook-triggering event.
"hooks": {
"post-create": [
{
"args": [
"echo",
"I'm a post-create"
]
}
],
"post-stop": [
{
"args": [
"echo",
"I'm a post-stop hook"
]
}
]
}
Which will just print messages to the host process's stdout for each hook-triggering event.
Hooks run after the container setup is complete but before the
configured process
is executed. With
--socket=PATH
these are run just before the
socket path is created. This is useful for additional container
configuration (e.g. creating cgroups or performing network setup).
post-create
(optional, array of objects) holds process objects (likeprocess
except for stdin handling and the lack ofhost
) to run after the post-create event.
Each hook receives the container process's PID in the host PID
namespace on its stdin. Its stdout and
stderr are inherited from the host process (unless
terminal
is true). The hooks are executed in the
listed order, the host process waits until each hook exits before
executing the next, and a nonzero exit code from any hook will cause
the host process to abandon further hook execution,
SIGKILL
the container process. The host process resumes
the usual lifecycle at “waits on child death”.
"post-create": [
{
"args": [
"mkdir",
"-p",
"/sys/fs/cgroup/unified/nginx-0/container"
]
},
{
"args": [
"tee",
"/sys/fs/cgroup/unified/nginx-0/container/cgroup.procs"
]
}
]
Which will create new nginx-0
and nginx-0/container
cgroups in the
unified hierarchy (if they don't already exist) and
add the container process to that cgroup.
Hooks run after the host process has reaped the container process. You could handle this in the shell with:
$ ccon; post_stop_hook_1; post_stop_hook_2
but the most common use will be cleaning up after post-create hooks, and it's nice to configure both in the same place (the ccon config file).
post-stop
(optional, array of objects) holds process objects (likeprocess
except for the lack ofhost
) to run after the post-stop event.
Its standard streams are inherited from the host process
(unless terminal
is true). The hooks are executed
in the listed order, the host process waits until each hook exits
before executing the next, and a nonzero exit code from any hook will
cause the host process to print a message to stderr, after which it
continues as if the hook had exited with zero.
"post-stop": [
{
"args": [
"rmdir",
"/sys/fs/cgroup/unified/nginx-0/container"
]
},
{
"args": [
"rmdir",
"/sys/fs/cgroup/unified/nginx-0"
]
}
]
Which will remove nginx-0/container
and nginx-0
cgroups (such as
those created by the post-create example. This
will only succeed if the namespaces are empty, so if you were using
this in production it would be best to:
- Ensure there were no other processes in those cgroups (e.g. by
creating a new PID namespace and adding all
additional processes to that namespace before adding them to the
nginx-0
cgroup tree) - Use a tool like
cgdelete
to recursively removenginx-0
, which would also remove additional child cgroups beyondnginx-0/container
that may have been added by other processes sincenginx-0
was created.
- Linux headers for 3.19+ for
execveat
(sys-kernel/linux-headers on Gentoo). - The GNU C Library (sys-libs/glibc on Gentoo).
- Jansson for JSON parsing (dev-libs/jansson on Gentoo).
- libcap-ng for adjusting capabilities (sys-libs/libcap-ng on Gentoo).
Ccon is pretty easy to compile, but to use the stock Makefile, you'll need:
- A C compiler like GCC (sys-devel/gcc on Gentoo).
- GNU Make (sys-devel/make on Gentoo).
- pkg-config (dev-util/pkgconfig on Gentoo).
- indent (dev-util/indent on Gentoo). Invoke with
make fmt
.
- Ccon is under the GPLv3+.
- Glibc is under the LGPL-2.1+.
- Jansson is under the MIT license.
- libcap-ng is under the LGPL-2.1+.
Because all the dependencies are GPL-compatible, ccon binaries can be distributed under the GPLv3+.