rkt
is designed to cooperate with init systems, like systemd
. rkt implements a simple CLI that directly executes processes, and does not interpose a long-running daemon, so the lifecycle of rkt pods can be directly managed by systemd. Standard systemd idioms like systemctl start
and systemctl stop
work out of the box.
In the shell excerpts below, a #
prompt indicates commands that require root privileges, while the $
prompt denotes commands issued as an unprivileged user.
The systemd-run
utility is a convenient shortcut for testing a service before making it permanent in a unit file. To start a "daemonized" container that forks the container processes into the background, wrap the invocation of rkt
with systemd-run
:
# systemd-run --slice=machine rkt run coreos.com/etcd:v2.2.5
Running as unit run-29486.service.
The --slice=machine
option to systemd-run
places the service in machine.slice
rather than the host's system.slice
, isolating containers in their own cgroup area.
Invoking a rkt container through systemd-run in this way creates a transient service unit that can be managed with the usual systemd tools:
$ systemctl status run-29486.service
● run-29486.service - /bin/rkt run coreos.com/etcd:v2.2.5
Loaded: loaded (/run/systemd/system/run-29486.service; static; vendor preset: disabled)
Drop-In: /run/systemd/system/run-29486.service.d
└─50-Description.conf, 50-ExecStart.conf, 50-Slice.conf
Active: active (running) since Wed 2016-02-24 12:50:20 CET; 27s ago
Main PID: 29487 (ld-linux-x86-64)
Memory: 36.1M
CPU: 1.467s
CGroup: /machine.slice/run-29486.service
├─29487 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn --boot -Zsystem_u:system_r:svirt_lxc_net_t:s0:c46...
├─29535 /usr/lib/systemd/systemd --default-standard-output=tty --log-target=null --log-level=warning --show-status=0
└─system.slice
├─etcd.service
│ └─29544 /etcd
└─systemd-journald.service
└─29539 /usr/lib/systemd/systemd-journald
Since every pod is registered with machined
with a machine name of the form rkt-$UUID
, the systemd tools can inspect pod logs, or stop and restart pod "machines". Use the machinectl
tool to print the list of rkt pods:
$ machinectl list
MACHINE CLASS SERVICE
rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 container nspawn
1 machines listed.
Given the name of this rkt machine, journalctl
can inspect its logs, or machinectl
can shut it down:
# journalctl -M rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23
...
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.518030 I | raft: ce2a822cea30bfca received vote from ce2a822cea30bfca at term 2
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.518073 I | raft: ce2a822cea30bfca became leader at term 2
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.518086 I | raft: raft.node: ce2a822cea30bfca elected leader ce2a822cea30bfca at te
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.518720 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2379 h
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.518955 I | etcdserver: setting up the initial cluster version to 2.2
Feb 24 12:50:22 rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23 etcd[4]: 2016-02-24 11:50:22.521680 N | etcdserver: set the initial cluster version to 2.2
# machinectl poweroff rkt-2b0b2cec-8f63-4451-9431-9f8e9b265a23
$ machinectl list
MACHINE CLASS SERVICE
0 machines listed.
Note that, for the "coreos" and "kvm" stage1 flavors, journald integration is only supported if systemd is compiled with lz4
compression enabled. To inspect this, use systemctl
:
$ systemctl --version
systemd 235
[...] +LZ4 [...]
If the output contains -LZ4
, journal entries will not be available.
Sometimes, defining dependencies between containers makes sense. An example use case of this is a container running a server, and another container running a client. We want the server to start before the client tries to connect.
This can be accomplished by using systemd services and dependencies. However, for this to work in rkt containers, we need special support.
systemd inside stage1 can notify systemd on the host that it is ready, to make sure that stage1 systemd send the notification at the right time you can use the sd_notify mechanism.
To make use of this feature, you need to set the annotation appc.io/executor/supports-systemd-notify
to true in the image manifest whenever the app supports sd_notify (see example manifest below).
If you build your image with acbuild
you can use the command: acbuild annotation add appc.io/executor/supports-systemd-notify true
.
{
"acKind": "ImageManifest",
"acVersion": "0.8.4",
"name": "coreos.com/etcd",
...
"app": {
"exec": [
"/etcd"
],
...
},
"annotations": [
"name": "appc.io/executor/supports-systemd-notify",
"value": "true"
]
}
This feature is always available when using the "coreos" stage1 flavor.
If you use the "host" stage1 flavor (e.g. Fedora RPM or Debian deb package), you will need systemd >= v231.
To verify how it works, run in a terminal the command: sudo systemd-run --unit=test --service-type=notify rkt run --insecure-options=image /path/to/your/app/image
, then periodically check the status with systemctl status test
.
If the pod uses a stage1 image with systemd v231 (or greater), then the pod will be seen active form the host when systemd inside stage1 will reach default target. Instead, before it was marked as active as soon as it started. In this way it is possible to easily set up dependencies between pods and host services.
Moreover, using SdNotify()
in the application it is possible to make the pod marked as ready when all the apps or a particular one is ready.
For more information check systemd services unit documentation.
This is how the sd_notify signal is propagated to the host system:
Below there is a simple example of an app using the systemd notification mechanism via go-systemd binding library.
package main
import (
"log"
"net"
"net/http"
"github.com/coreos/go-systemd/daemon"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
log.Printf("request from %v\n", r.RemoteAddr)
w.Write([]byte("hello\n"))
})
ln, err := net.Listen("tcp", ":5000")
if err != nil {
log.Fatalf("Listen failed: %s", err)
}
sent, err := daemon.SdNotify(true, "READY=1")
if err != nil {
log.Fatalf("Notification failed: %s", err)
}
if !sent {
log.Fatalf("Notification not supported: %s", err)
}
log.Fatal(http.Serve(ln, nil))
}
You can run an app that supports sd\_notify()
with this command:
# systemd-run --slice=machine --service-type=notify rkt run coreos.com/etcd:v2.2.5
Running as unit run-29486.service.
The following is a simple example of a unit file using rkt
to run an etcd
instance under systemd service management:
[Unit]
Description=etcd
[Service]
Slice=machine.slice
ExecStart=/usr/bin/rkt run coreos.com/etcd:v2.2.5
KillMode=mixed
Restart=always
This unit can now be managed using the standard systemctl
commands:
# systemctl start etcd.service
# systemctl stop etcd.service
# systemctl restart etcd.service
# systemctl enable etcd.service
# systemctl disable etcd.service
Note that no ExecStop
clause is required. Setting KillMode=mixed
means that running systemctl stop etcd.service
will send SIGTERM
to stage1
's systemd
, which in turn will initiate orderly shutdown inside the pod. Systemd is additionally able to send the cleanup SIGKILL
to any lingering service processes, after a timeout. This comprises complete pod lifecycle management with familiar, well-known system init tools.
A more advanced unit example takes advantage of a few convenient systemd
features:
- Inheriting environment variables specified in the unit with
--inherit-env
. This feature helps keep units concise, instead of layering on many flags torkt run
. - Using the dependency graph to start our pod after networking has come online. This is helpful if your application requires outside connectivity to fetch remote configuration (for example, from
etcd
). - Set resource limits for this
rkt
pod. This can also be done in the unit file, rather than flagged torkt run
. - Set
ExecStopPost
to invokerkt gc --mark-only
to record the timestamp when the pod exits. (Runrkt gc --help
to see more details about this flag). After runningrkt gc --mark-only
, the timestamp can be retrieved from rkt API service in pod'sgc_marked_at
field. The timestamp can be treated as the finished time of the pod.
Here is what it looks like all together:
[Unit]
# Metadata
Description=MyApp
Documentation=https://myapp.com/docs/1.3.4
# Wait for networking
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
# Resource limits
Delegate=true
CPUShares=512
MemoryLimit=1G
# Env vars
Environment=HTTP_PROXY=192.0.2.3:5000
Environment=STORAGE_PATH=/opt/myapp
Environment=TMPDIR=/var/tmp
# Fetch the app (not strictly required, `rkt run` will fetch the image if there is not one)
ExecStartPre=/usr/bin/rkt fetch myapp.com/myapp-1.3.4
# Start the app
ExecStart=/usr/bin/rkt run --inherit-env --port=http:8888 myapp.com/myapp-1.3.4
ExecStopPost=/usr/bin/rkt gc --mark-only
KillMode=mixed
Restart=always
rkt must be the main process of the service in order to support isolators correctly and to be well-integrated with systemd-machined. To ensure that rkt is the main process of the service, the pattern /bin/sh -c "foo ; rkt run ..."
should be avoided, because in that case the main process is sh
.
In most cases, the parameters Environment=
and ExecStartPre=
can simply be used instead of starting a shell. If shell invocation is unavoidable, use exec
to ensure rkt replaces the preceding shell process:
ExecStart=/bin/sh -c "foo ; exec rkt run ..."
rkt
inherits resource limits configured in the systemd service unit file. The systemd documentation explains various execution environment, and resource control settings to restrict the CPU, IO, and memory resources.
For example to restrict the CPU time quota, configure the corresponding CPUQuota setting:
[Service]
ExecStart=/usr/bin/rkt run s-urbaniak.github.io/images/stress:0.0.1
CPUQuota=30%
$ ps -p <PID> -o %cpu%
CPU
30.0
Moreover to pin the rkt pod to certain CPUs, configure the corresponding CPUAffinity setting:
[Service]
ExecStart=/usr/bin/rkt run s-urbaniak.github.io/images/stress:0.0.1
CPUAffinity=0,3
$ top
Tasks: 235 total, 1 running, 234 sleeping, 0 stopped, 0 zombie
%Cpu0 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||
%Cpu1 : 6.0/0.7 7[|||
%Cpu2 : 0.7/0.0 1[
%Cpu3 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||
GiB Mem : 25.7/19.484 [
GiB Swap: 0.0/8.000 [
PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
11684 root 20 0 3.6m 1.1m 200.0 0.0 8:58.63 S stress
rkt
supports socket-activated services. This means systemd will listen on a port on behalf of a container, and start the container when receiving a connection. An application needs to be able to accept sockets from systemd's native socket passing interface in order to handle socket activation.
To make socket activation work, add a socket-activated port to the app container manifest:
...
{
...
"app": {
...
"ports": [
{
"name": "80-tcp",
"protocol": "tcp",
"port": 80,
"count": 1,
"socketActivated": true
}
]
}
}
Then you will need a pair of .service
and .socket
unit files.
In this example, we want to use the port 8080 on the host instead of the app's default 80, so we use rkt's --port
option to override it.
# my-socket-activated-app.socket
[Unit]
Description=My socket-activated app's socket
[Socket]
ListenStream=8080
# my-socket-activated-app.service
[Unit]
Description=My socket-activated app
[Service]
ExecStart=/usr/bin/rkt run --port 80-tcp:8080 myapp.com/my-socket-activated-app:v1.0
KillMode=mixed
Finally, start the socket unit:
# systemctl start my-socket-activated-app.socket
$ systemctl status my-socket-activated-app.socket
● my-socket-activated-app.socket - My socket-activated app's socket
Loaded: loaded (/etc/systemd/system/my-socket-activated-app.socket; static; vendor preset: disabled)
Active: active (listening) since Thu 2015-07-30 12:24:50 CEST; 2s ago
Listen: [::]:8080 (Stream)
Jul 30 12:24:50 locke-work systemd[1]: Listening on My socket-activated app's socket.
Now, a new connection to port 8080 will start your container to handle the request.
rkt
also supports the socket-proxyd service. Much like socket activation, with socket-proxyd systemd provides a listener on a given port on behalf of a container, and starts the container when a connection is received. Socket-proxy listening can be useful in environments that lack native support for socket activation. The LKVM stage1 flavor is an example of such an environment.
To set up socket proxyd, create a network template consisting of three units, like the example below. This example uses the redis app and the PTP network template in /etc/rkt/net.d/ptp0.conf
:
{
"name": "ptp0",
"type": "ptp",
"ipMasq": true,
"ipam": {
"type": "host-local",
"subnet": "172.16.28.0/24",
"routes": [
{ "dst": "0.0.0.0/0" }
]
}
}
# rkt-redis.service
[Unit]
Description=Socket-proxyd redis server
[Service]
ExecStart=/usr/bin/rkt --insecure-options=image run --net="ptp:IP=172.16.28.101" docker://redis
KillMode=process
Note that you have to specify IP manually in systemd unit.
Then you will need a pair of .service
and .socket
unit files.
We want to use the port 6379 on the localhost instead of the remote container IP, so we use next systemd unit to override it.
# proxy-to-rkt-redis.service
[Unit]
Requires=rkt-redis.service
After=rkt-redis.service
[Service]
ExecStart=/usr/lib/systemd/systemd-socket-proxyd 172.16.28.101:6379
Lastly the related socket unit,
# proxy-to-rkt-redis.socket
[Socket]
ListenStream=6371
[Install]
WantedBy=sockets.target
Finally, start the socket unit:
# systemctl enable proxy-to-redis.socket
$ sudo systemctl start proxy-to-redis.socket
● proxy-to-rkt-redis.socket
Loaded: loaded (/etc/systemd/system/proxy-to-rkt-redis.socket; enabled; vendor preset: disabled)
Active: active (listening) since Mon 2016-03-07 11:53:32 CET; 8s ago
Listen: [::]:6371 (Stream)
Mar 07 11:53:32 user-host systemd[1]: Listening on proxy-to-rkt-redis.socket.
Mar 07 11:53:32 user-host systemd[1]: Starting proxy-to-rkt-redis.socket.
Now, a new connection to localhost port 6371 will start your container with redis, to handle the request.
$ curl http://localhost:6371/
Let us assume the service from the simple example unit file, above, is started on the host.
The snippet below taken from output of ps auxf
shows several things:
rkt
exec
s stage1'ssystemd-nspawn
instead of usingfork-exec
technique. That is why rkt itself is not listed byps
.systemd-nspawn
runs a typical boot sequence - it spawnssystemd
inside the container, which in turn spawns our desired service(s).- There can be also other services running, which may be
systemd
-specific, likesystemd-journald
.
$ ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 7258 0.2 0.0 19680 2664 ? Ss 12:38 0:02 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn --boot --register=true --link-journal=try-guest --quiet --keep-unit --uuid=6d0d9608-a744-4333-be21-942145a97a5a --machine=rkt-6d0d9608-a744-4333-be21-942145a97a5a --directory=stage1/rootfs -- --default-standard-output=tty --log-target=null --log-level=warning --show-status=0
root 7275 0.0 0.0 27348 4316 ? Ss 12:38 0:00 \_ /usr/lib/systemd/systemd --default-standard-output=tty --log-target=null --log-level=warning --show-status=0
root 7277 0.0 0.0 23832 6100 ? Ss 12:38 0:00 \_ /usr/lib/systemd/systemd-journald
root 7343 0.3 0.0 10652 7332 ? Ssl 12:38 0:04 \_ /etcd
The systemd-cgls
command prints the list of cgroups active on the system. The inner system.slice
shown in the excerpt below is a cgroup in rkt's stage1
, below which an in-container systemd has been started to shepherd pod apps with complete process lifecycle management:
$ systemd-cgls
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─machine.slice
│ └─etcd.service
│ ├─1204 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/s...
│ ├─1421 /usr/lib/systemd/systemd --default-standard-output=tty --log-targe...
│ └─system.slice
│ ├─etcd.service
│ │ └─1436 /etcd
│ └─systemd-journald.service
│ └─1428 /usr/lib/systemd/systemd-journald
To display all active cgroups, use the --all
flag. This will show two cgroups for mount
in the host's system.slice
. One mount cgroup is for the stage1
root filesystem, the other for the stage2
root (the pod's filesystem). Inside the pod's system.slice
there are more mount
cgroups -- mostly for bind mounts of standard /dev
-tree device files.
$ systemd-cgls --all
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─machine.slice
│ └─etcd.service
│ ├─1204 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/s...
│ ├─1421 /usr/lib/systemd/systemd --default-standard-output=tty --log-targe...
│ └─system.slice
│ ├─proc-sys-kernel-random-boot_id.mount
│ ├─opt-stage2-etcd-rootfs-proc-kmsg.mount
│ ├─opt-stage2-etcd-rootfs-sys.mount
│ ├─opt-stage2-etcd-rootfs-dev-shm.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-perf_event.mount
│ ├─etcd.service
│ │ └─1436 /etcd
│ ├─opt-stage2-etcd-rootfs-proc-sys-kernel-random-boot_id.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-cpu\x2ccpuacct.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-devices.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-freezer.mount
│ ├─shutdown.service
│ ├─-.mount
│ ├─opt-stage2-etcd-rootfs-data\x2ddir.mount
│ ├─system-prepare\x2dapp.slice
│ ├─tmp.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-cpuset.mount
│ ├─opt-stage2-etcd-rootfs-proc.mount
│ ├─systemd-journald.service
│ │ └─1428 /usr/lib/systemd/systemd-journald
│ ├─opt-stage2-etcd-rootfs.mount
│ ├─opt-stage2-etcd-rootfs-dev-random.mount
│ ├─opt-stage2-etcd-rootfs-dev-pts.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup.mount
│ ├─run-systemd-nspawn-incoming.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-systemd-machine.slice-etcd.service.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-memory-machine.slice-etcd.service-system.slice-etcd.service-cgroup.procs.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-blkio.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-net_cls\x2cnet_prio.mount
│ ├─opt-stage2-etcd-rootfs-dev-net-tun.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-memory-machine.slice-etcd.service-system.slice-etcd.service-memory.limit_in_bytes.mount
│ ├─opt-stage2-etcd-rootfs-dev-tty.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-pids.mount
│ ├─reaper-etcd.service
│ ├─opt-stage2-etcd-rootfs-sys-fs-selinux.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-memory.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-cpu\x2ccpuacct-machine.slice-etcd.service-system.slice-etcd.service-cpu.cfs_quota_us.mount
│ ├─opt-stage2-etcd-rootfs-dev-urandom.mount
│ ├─opt-stage2-etcd-rootfs-dev-zero.mount
│ ├─opt-stage2-etcd-rootfs-dev-null.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-systemd.mount
│ ├─opt-stage2-etcd-rootfs-dev-console.mount
│ ├─opt-stage2-etcd-rootfs-dev-full.mount
│ ├─opt-stage2-etcd-rootfs-sys-fs-cgroup-cpu\x2ccpuacct-machine.slice-etcd.service-system.slice-etcd.service-cgroup.procs.mount
│ ├─opt-stage2-etcd-rootfs-proc-sys.mount
│ └─opt-stage2-etcd-rootfs-sys-fs-cgroup-hugetlb.mount