-
Notifications
You must be signed in to change notification settings - Fork 3
Container Runtimes solution trade off
The purpose of this trade-off is to evaluate the best runtime to execute containers in the Kubernetes cluster.
Several components compose Kubernetes container management.
We will evaluate here container runtime interfaces (CRI) as well as the container runtime solutions.
Note: Kubernetes introduced a stable version of Runtime Class in v1.20 that allows a pod to select a particular container runtime.
- It has to implement Kubernetes CRI
- It has to support OCI runtime-spec and OCI image-spec
- It has to be an active and alive project
- It must be mature enough to be used in a production environment
4 solutions will be evaulated here as container runtime interface: Containerd, CRI-O, Docker, PouchContainer
CRI | Community | Support |
---|---|---|
Containerd | 9.3k stars / 1.8k forks / 384 contributors | graduated from CNCF |
CRI-O | 3.6k stars / 677 forks / 194 contributors | CNCF incubating project |
Docker | 61.7k stars / 17.7k forks / 2131 contributors | Part of Moby project |
PouchContainer | 4.5k stars / 960 forks / 110 contributors | Alibaba |
Pros:
- very mature since it comes from Docker itself and it is CNCF graduated
- officially supported by Kubernetes
- default and officially supported by AKS, EKS, GKE, k3s
- present in most Kubernetes cluster installations
- fully support OCI runtime-spec and OCI image-spec
- support Windows Kubernetes nodes
- follow plugin model
Pros:
- lightweight
- its releases follow Kubernetes releases
- dedicated for Kubernetes
- officially supported by Kubernetes
- shipped in OpenShift, supported by Prisma Cloud
- compliant with OCI runtimes and OCI images
- specific custom config by Pod annotations
- support user namespaces
- high-performance mode
Cons:
- not widely officially supported yet
Pros:
- Very mature: existed before Kubernetes and power clusters since Kubernetes first release.
- officially supported by Kubernetes
- compliant with OCI runtimes and OCI images
- integrate containerd with all its features
Cons:
- deprecated by Kubernetes since v1.20
- provides lots of features unnecessary in a Kubernetes cluster
- add an unnecessary layer between the runtime and the kubelet
Pros:
- P2P image distribution
- compatible with old kernel versions
- compatible with OCI runtimes and OCI images
Cons:
- provides features unnecessary in a Kubernetes cluster
- not officially supported by Kubernetes
- not active project: the last commit is from September 2020
Containerd and CRI-O are the two solutions meeting the requirements stated above. They are both very mature and stable solutions for a production Kubernetes cluster. The container runtime interface choice doesn't improve or impact the business strategy related to the project.
About CRI-O supporting user namespaces
This new namespace introduced by Linux kernel 3.8 brings containers security to another level. It makes the container believes it runs as privileged while re-map it to a less-privileged user on the host.
Kubernetes has long time issues (#127, #2101) related to this feature but nothing upstream yet.
Currently, running containers with user namespaces brings big challenges and complexity for stateful applications and mounting shared filesystems. There are several patches done on Linux kernel introducing idmapped mount for fat, ext4, xfs (v5.12) and btrfs (v5.15) but it does not support overlayfs yet. There is also some work in progress on containerd to support idmapped mount (#5888).
To conclude on this feature, it is a neat security improvement for Pod to Pod and Pod to node isolation in Kubernetes but yet still a work in progress from the communities in Linux Kernel, Kubernetes, CRI-O, and containerd.
This feature allows the admin to disable cpu load-balancing and CFS quota in case of latency-sensitive workloads. This feature does not reflect our needs.
Containerd features are sufficient for this project. Moreover, it is the safest choice regarding its broad adoption. Thus we decided to go with containerd.
- Who's Running My Pods? A Deep Dive into the K8s Container Runtime Interface by Phil Estes, November 2018
- Kubernetes deprecates dockershim, December 2020
- Improving Kubernetes and container security with user namespaces by Alban Crequy, December 2020
- Introduction and Deep Dive Into Containerd by Kohei Tokunaga & Akihiro Suda, Mai 2021
- CRI-O: The Runtime Control Room by Sascha Grunert, SUSE, Peter Hunt, Urvashi Mohnani, Mrunal Patel, December 2020
- It has to be compliant with OCI runtime-spec to work with the container runtime interface (CRI).
- It has to be open-source.
- It has to be mature enough and have a solid community.
To evaluate each product, we rely on their official pages and different benchmarks and analysis.
4 products meet the requirements stated above:
OCI Runtime | Performance cost | Security | Community | Support |
---|---|---|---|---|
crun | very lightweight / can run app with PID 1 / require < 1M memory / 50% faster than runc to execute containers | default* | 1.2k stars / 127 forks / 53 contributors | Part of Containers project on GitHub |
gVisor | syscall overhead / slow networking / bandwidth overhead / IO overhead | default* + system calls isolation / only 67/350 syscalls sent to host kernel | 11.7k stars / 966 forks / 148 contributors | |
Kata Containers | big memory footprint / 100Mb overhead for virtual machine and guest OS / slow IO (but possibility of passthrough hardware) / | default* + lightweight VM - hardware simulation) | 1.5k stars / 253 forks / 172 contributors | OpenStack Foundation, 99cloud, AWcloud, Canonical, China Mobile, City Network, CoreOS, Dell/EMC, EasyStack, Fiberhome, Google, Huawei, JD .com, Mirantis, NetApp, Red Hat, SUSE, Tencent, Ucloud, UnitedStack and ZTE. |
runc | standard implemented by most CRI (the one we compare against) | default* | 8.4k stars / 1.6k forks / 275 contributors | Open Container Initiative (OCI) |
*A container default security is based on the following:
- isolation by namespaces
- cgroups to control resources access
- limited system calls with seccomp profiles
- Linux Capabilities for privilege access rights.
- Mandatory Access Control (MAC) to restrict objects access (AppArmor / SELinux)
Pros:
- very lightweight footprint
- faster to execute containers
- binary 50x smaller than runc
Cons:
- seccomp and MAC security is difficult to adjust properly
- language more error-prone
Pros:
- good security
- raw computing as efficient as runc
Cons:
- performances loss (networking / IO) due to syscalls overhead
Pros:
- strong security
- can exploit VM features (like hardware passthrough)
- good performances overall
- big and active community
Cons:
- heavy memory footprint
Pros:
- the default implemented in most CRI
- officially supported by Containerd and CRI-O
- good community
Cons:
- seccomp and MAC security is difficult to adjust properly
crun
offers a scalability boost for spinning up containers faster than runc does. However, in this project, the containers will execute java code in the business workflow. Hence, the speed advantage of crun is insignificant compared to the application and jvm speed.
gVisor
and Kata containers
push container security further in protecting the host from possible containers breakout when security is vitally important for the platform. However, it also comes with additional complexity and performances flaws. For this project, such complexity is not necessary and does not reflect our reality.
Therefore, we decide to deploy runc
on the Kubernetes nodes and rely on default container security and Kubernetes policies to ensure cluster security.
runc
is a good choice in most cases as it already proves itself in many Kubernetes clusters in production for its efficiency and stability.
- A Comprehensive Container Runtime Comparison by Evan Baker, July 2020
- Performance Evaluation of Container Runtimes by Lennart Espe, Anshul Jindal, Vladimir Podolskiy and Michael Gerndt, Mai 2020
- The True Cost of Containing: A gVisor Case Study by Ethan G. Young, Pengfei Zhu, Tyler Caraza-Harter, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, July 2019
- Kata Containers and gVisor: a Quantitative Comparison by Xu Wang and Fupan Li, December 2018
- gVisor performance guide
- An introduction to crun, a fast and low-memory footprint container runtime by Dan Walsh, Valentin Rothberg, Giuseppe Scrivano, August 2020
There is growing interest in using different runtimes within a cluster. Sandboxes are the primary motivator for this right now, with Kata containers and gVisor looking to integrate with Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also require support in the future. RuntimeClass provides a way to select between different runtimes configured in the cluster and surface their properties (both to the cluster & the user).
Since v1.20, Kubernetes implement stable version of Runtime Class. Users can define the container runtime for a Pod with a field in the pod/deployment definition.
The v1.16 introduced in beta the possibility to set constraints to ensure the Pods running with a RuntimeClass get scheduled to nodes that support it.