This document has troubleshooting tips when installing or using Sysbox on Kubernetes clusters.
For troubleshooting outside of Kubernetes environments, see here.
- sysbox-deploy-k8s fails to start
- sysbox-deploy-k8s fails to install Sysbox
- The sysbox-deploy-k8s daemonset causes pods to enter "Error" state
- CRI-O Can't find CNI Binaries
- Pod stuck in "Creating" status
- Sysbox health status
- CRI-O health status
- Kubelet health status
- Low-level Debug with crictl
This is likely because either:
-
The RBAC resource for the sysbox-deploy-k8s is not present.
-
The K8s worker node is not labeled with "sysbox-install=yes".
Make sure to follow the Sysbox installation instructions to solve this.
If the sysbox-deploy-k8s daemonset fails to install Sysbox, take a look at the logs for the sysbox-deploy-k8s pod (there is one such pod for each K8s worker node where sysbox is installed). The logs should ideally look like this:
Adding K8s label "crio-runtime=installing" to node
node/gke-cluster-3-default-pool-766039d3-68mw labeled
Deploying CRI-O installer agent on the host ...
Running CRI-O installer agent on the host (may take several seconds) ...
Removing CRI-O installer agent from the host ...
Configuring CRI-O ...
Adding K8s label "sysbox-runtime=installing" to node
node/gke-cluster-3-default-pool-766039d3-68mw labeled
Installing Sysbox dependencies on host
Copying shiftfs sources to host
Deploying Sysbox installer helper on the host ...
Running Sysbox installer helper on the host (may take several seconds) ...
Stopping the Sysbox installer helper on the host ...
Removing Sysbox installer helper from the host ...
Installing Sysbox on host
Detected host distro: ubuntu_20.04
Configuring host sysctls
kernel.unprivileged_userns_clone = 1
fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_watches = 1048576
fs.inotify.max_user_instances = 1048576
kernel.keys.maxkeys = 20000
kernel.keys.maxbytes = 400000
Starting Sysbox
Adding Sysbox to CRI-O config
Restarting CRI-O ...
Deploying Kubelet config agent on the host ...
Running Kubelet config agent on the host (will restart Kubelet and temporary bring down all pods on this node for ~1 min) ...
The sysbox-deploy-k8s pod will restart the kubelet on the K8s node. After the restart, the pod's log will continue as follows:
Stopping the Kubelet config agent on the host ...
Removing Kubelet config agent from the host ...
Kubelet reconfig completed.
Adding K8s label "crio-runtime=running" to node
node/gke-cluster-3-default-pool-766039d3-68mw labeled
Adding K8s label "sysbox-runtime=running" to node
node/gke-cluster-3-default-pool-766039d3-68mw labeled
The k8s runtime on this node is now CRI-O.
Sysbox installation completed.
Done.
In addition, the sysbox-deploy-k8s creates 3 ephemeral systemd services on the
K8s nodes where Sysbox is installed. These ephemeral systemd services are
short-lived and help with the installation of CRI-O and Sysbox, and with the
reconfiguration of the kubelet. They are caled crio-installer
,
sysbox-installer-helper
, and kubelet-config-helper
. Look at the logs
generated by these systemd services in the K8s worker nodes to make sure they
don't report errors:
journalctl -eu crio-installer
journalctl -eu sysbox-installer-helper
journalctl -eu kubelet-config-helper
The sysbox-deploy-k8s daemonset may cause some pods to enter an error state temporarily:
$ kubectl get all --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-74ff55c5b-ff58f 1/1 Running 0 11d
kube-system pod/coredns-74ff55c5b-t6t5b 1/1 Error 0 11d
kube-system pod/etcd-k8s-node2 1/1 Running 1 11d
kube-system pod/kube-apiserver-k8s-node2 1/1 Running 1 11d
kube-system pod/kube-controller-manager-k8s-node2 1/1 Running 1 11d
kube-system pod/kube-flannel-ds-4pqgp 1/1 Error 4 4d22h
kube-system pod/kube-flannel-ds-lkvnp 1/1 Running 0 10d
kube-system pod/kube-proxy-4mbfj 1/1 Error 4 4d22h
kube-system pod/kube-proxy-5lfz4 1/1 Running 0 11d
kube-system pod/kube-scheduler-k8s-node2 1/1 Running 1 11d
kube-system pod/sysbox-deploy-k8s-rbl76 1/1 Error 1 136m
This is expected because sysbox-deploy-k8s restarts the Kubelet, and this causes all pods on the K8s worker node(s) where Sysbox is being installed to be re-created.
This is a condition that should resolve itself within 1->2 minutes after running the sysbox-deploy-k8s daemonset, as Kubernetes will automatically restart the affected pods.
If for some reason one of the pods remains in an error state forever, check the state and logs associated with that pod:
kubectl -n kube-system describe <pod>
kubectl -n kube-system logs <pod>
Feel free to open an issue in the Sysbox repo so we can investigate. As a work-around, if the pod is part of a deployment or daemonset, try removing the pod as this causes Kubernetes to re-create it and sometimes the problem this fixes the problem.
If the CRI-O log (journalctl -u crio
) shows an error such as:
Error validating CNI config file /etc/cni/net.d/10-containerd-net.conflist: [failed to find plugin in opt/cni/bin]
it means it can't find the binaries for the CNI. By default CRI-O looks for those in /opt/cni/bin
.
In addition, the sysbox-deploy-k8s daemonset configures CRI-O to look for CNIs in /home/kubernetes/bin/
,
as these is where they are found on GKE nodes.
If you see this error, find the directory where the CNIs are located in the worker node,
and add that directory to the /etc/crio/crio.conf
file:
[crio.network]
plugin_dirs = ["/opt/cni/bin/", "/home/kubernetes/bin"]
Then restart CRI-O and it should pick the CNI binaries correctly.
If you are deploying a pod with Sysbox and it gets stuck in the "Creating" state for a long time (e.g., > 1 minute), then something is likely wrong.
To debug, start by doing a kubectl describe <pod-name>
to see what
Kubernetes reports.
Additionally, check the CRI-O and Kubelet logs on the node where the pod was scheduled:
$ journalctl -eu kubelet
$ journalctl -eu crio
These often have information that helps pin-point the problem.
For AWS EKS nodes, the kubelet runs in a snap package; check it's health via:
$ snap get kubelet-eks
$ journalctl -eu snap.kubelet-eks.daemon.service
If the kubelet log shows an error such as failed to find runtime handler sysbox-runc from runtime list
,
then this means CRI-O has not recognized Sysbox for some reason.
In this case, double check that the CRI-O config has the Sysbox runtime directive in it (the sysbox-deploy-k8s daemonset should have set this config):
$ cat /etc/crio/crio.conf
...
[crio.runtime.runtimes.sysbox-runc]
allowed_annotations = ["io.kubernetes.cri-o.userns-mode"]
runtime_path = "/usr/bin/sysbox-runc"
runtime_type = "oci"
...
If the sysbox runtime config is present, then try restarting CRI-O on the worker node:
systemctl restart crio
Note that restarting CRI-O will cause all pods on the node to be restarted (including the kube-proxy and CNI pods).
If the sysbox runtime config is not present, then uninstall and re-install the sysbox daemonset.
$ systemctl status sysbox
$ systemctl status sysbox-mgr
$ systemctl status sysbox-fs
$ journalctl -eu sysbox
$ journalctl -eu sysbox-mgr
$ journalctl -eu sysbox-fs
$ systemctl status crio
$ journalctl -eu crio
$ systemctl status kubelet
$ journalctl -eu kubelet
The crictl
tool can be used to communicate with CRI implementations directly
(e.g., CRI-O or containerd). crictl
is typically present in K8s nodes. If for
some reason it's not, you can install it as shown here:
https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md
You should install crictl
on the K8s worker nodes where CRI-O is installed,
and configure it as follows:
$ cat /etc/crictl.yaml
runtime-endpoint: "unix:///var/run/crio/crio.sock"
image-endpoint: "unix:///var/run/crio/crio.sock"
timeout: 0
debug: false
pull-image-on-create: true
disable-pull-on-run: false
The key is to set runtime-endpoint
and image-endpoint
to CRI-O as shown
above.
After this, you can use crictl to determine the health of the pods on the node:
For example, in a properly initialized K8s worker node you should see the kube-proxy pod running on the node.
$ sudo crictl ps
CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT POD ID
48912b06799d2 43154ddb57a83de3068fe603e9c7393e7d2b77cb18d9e0daf869f74b1b4079c0 2 days ago Running kube-proxy 40 91462637d4e23