Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico-node crashed on Debian 11 #8186

Closed
DANic-git opened this issue Nov 11, 2021 · 8 comments · Fixed by #8206
Closed

Calico-node crashed on Debian 11 #8186

DANic-git opened this issue Nov 11, 2021 · 8 comments · Fixed by #8206
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@DANic-git
Copy link
Contributor

DANic-git commented Nov 11, 2021

Environment:

  • Cloud provider or hardware configuration:
    VM based on box generic/debian11

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 5.10.0-9-amd64 x86_64
    PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
    NAME="Debian GNU/Linux"
    VERSION_ID="11"
    VERSION="11 (bullseye)"
    VERSION_CODENAME=bullseye
    ID=debian
    HOME_URL="https://www.debian.org/"
    SUPPORT_URL="https://www.debian.org/support"
    BUG_REPORT_URL="https://bugs.debian.org/"

  • Version of Ansible (ansible --version):
    ansible 2.10.11

  • Version of Python (python --version):
    Python 3.9.6

Kubespray version (commit) (git rev-parse --short HEAD):
0d0468e

Network plugin used:
calico

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

https://termbin.com/76wp

calico_version: "v3.20.2"
calico_ip_auto_method: "interface=eth.*"
calico_bpf_enabled: true
kube_proxy_remove: true
calico_bpf_service_mode: "DSR"

containerd_version: 1.5.7
container_manager: containerd
etcd_deployment_type: host

Command used to invoke ansible:
ansible-playbook -i inventory/hosts.yaml -b cluster.yml

Output of ansible run:

Anything else do we need to know:

kubectl describe po -n kube-system calico-node-q2c7r

Name:                 calico-node-q2c7r
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 node3/192.168.0.203
Start Time:           Thu, 11 Nov 2021 07:31:35 +0000
Labels:               controller-revision-hash=67b95d5764
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   192.168.0.203
IPs:
  IP:           192.168.0.203
Controlled By:  DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  containerd://4f594b922653d2c0093af8d1346f23c134abc5589b25d03cd072a5b0cf78416e
    Image:         quay.io/calico/cni:v3.20.2
    Image ID:      quay.io/calico/cni@sha256:523f10d2da3872198d80cf3571da97e6f36f7839c9fc6424430866fead679274
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Nov 2021 08:02:05 +0000
      Finished:     Thu, 11 Nov 2021 08:02:05 +0000
    Ready:          True
    Restart Count:  5
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q59pv (ro)
  install-cni:
    Container ID:  containerd://836d8b3a0f6f3cf9e7567c392ffdb95e1ea2501bb879158966a20a784edaf15b
    Image:         quay.io/calico/cni:v3.20.2
    Image ID:      quay.io/calico/cni@sha256:523f10d2da3872198d80cf3571da97e6f36f7839c9fc6424430866fead679274
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Nov 2021 08:02:07 +0000
      Finished:     Thu, 11 Nov 2021 08:02:08 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:            10-calico.conflist
      UPDATE_CNI_BINARIES:      true
      CNI_NETWORK_CONFIG_FILE:  /host/etc/cni/net.d/calico.conflist.template
      SLEEP:                    false
      KUBERNETES_NODE_NAME:      (v1:spec.nodeName)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q59pv (ro)
  flexvol-driver:
    Container ID:   containerd://deaf56e873525e90acd27282e1e46042bf0d88dc491bd23b35522ff1e4f7327c
    Image:          quay.io/calico/pod2daemon-flexvol:v3.20.2
    Image ID:       quay.io/calico/pod2daemon-flexvol@sha256:0d963c8e0313ab57c84c1f7de7ff3e9119cd519d8bbbc9698cd7fa92b0d6eca9
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Nov 2021 08:02:10 +0000
      Finished:     Thu, 11 Nov 2021 08:02:10 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q59pv (ro)
Containers:
  calico-node:
    Container ID:   containerd://b717549566b9015eefadb45faa46fa44f3fd9551d40828905ece1aeedb33748d
    Image:          quay.io/calico/node:v3.20.2
    Image ID:       quay.io/calico/node@sha256:7aecbf8eb397b4838fd656a44c30007cfb506ff7e8ee6718ec3660979628d1a3
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 11 Nov 2021 08:03:31 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 11 Nov 2021 07:58:58 +0000
      Finished:     Thu, 11 Nov 2021 08:01:58 +0000
    Ready:          False
    Restart Count:  12
    Limits:
      cpu:     300m
      memory:  500M
    Requests:
      cpu:      150m
      memory:   64M
    Liveness:   exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=10s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=10s period=10s #success=1 #failure=6
    Environment Variables from:
      kubernetes-services-endpoint  ConfigMap  Optional: true
    Environment:
      DATASTORE_TYPE:                         kubernetes
      WAIT_FOR_DATASTORE:                     true
      CALICO_NETWORKING_BACKEND:              <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                           <set to the key 'cluster_type' of config map 'calico-config'>    Optional: false
      CALICO_K8S_NODE_REF:                     (v1:spec.nodeName)
      CALICO_DISABLE_FILE_LOGGING:            true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:      RETURN
      FELIX_HEALTHHOST:                       localhost
      FELIX_IPTABLESBACKEND:                  Legacy
      FELIX_IPTABLESLOCKTIMEOUTSECS:          10
      CALICO_IPV4POOL_IPIP:                   Off
      FELIX_IPV6SUPPORT:                      False
      FELIX_LOGSEVERITYSCREEN:                info
      CALICO_STARTUP_LOGLEVEL:                error
      FELIX_USAGEREPORTINGENABLED:            False
      FELIX_CHAININSERTMODE:                  Insert
      FELIX_PROMETHEUSMETRICSENABLED:         False
      FELIX_PROMETHEUSMETRICSPORT:            9091
      FELIX_PROMETHEUSGOMETRICSENABLED:       True
      FELIX_PROMETHEUSPROCESSMETRICSENABLED:  True
      IP_AUTODETECTION_METHOD:                interface=eth.*
      IP:                                     autodetect
      NODENAME:                                (v1:spec.nodeName)
      FELIX_HEALTHENABLED:                    true
      FELIX_IGNORELOOSERPF:                   False
      CALICO_MANAGE_CNI:                      true
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /sys/fs/ from sysfs (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (ro)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q59pv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  sysfs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/
    HostPathType:  DirectoryOrCreate
  cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  kube-api-access-q59pv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  32m                 default-scheduler  Successfully assigned kube-system/calico-node-q2c7r to node3
  Normal   Pulled     32m                 kubelet            Container image "quay.io/calico/cni:v3.20.2" already present on machine
  Normal   Created    32m                 kubelet            Created container upgrade-ipam
  Normal   Started    32m                 kubelet            Started container upgrade-ipam
  Normal   Pulled     32m                 kubelet            Container image "quay.io/calico/cni:v3.20.2" already present on machine
  Normal   Created    32m                 kubelet            Created container install-cni
  Normal   Started    32m                 kubelet            Started container install-cni
  Normal   Pulling    32m                 kubelet            Pulling image "quay.io/calico/pod2daemon-flexvol:v3.20.2"
  Normal   Pulling    32m                 kubelet            Pulling image "quay.io/calico/pod2daemon-flexvol:v3.20.2"
  Normal   Pulled     31m                 kubelet            Successfully pulled image "quay.io/calico/pod2daemon-flexvol:v3.20.2" in 11.938642187s
  Normal   Created    31m                 kubelet            Created container flexvol-driver
  Normal   Started    31m                 kubelet            Started container flexvol-driver
  Normal   Created    31m                 kubelet            Created container calico-node
  Normal   Started    31m                 kubelet            Started container calico-node
  Warning  Unhealthy  31m                 kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/nodename: no such file or directory
  Warning  Unhealthy  31m                 kubelet            Liveness probe failed: calico/node is not ready: Felix is not live: Get "http://localhost:9099/liveness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  31m (x2 over 31m)   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
  Warning  Unhealthy  30m                 kubelet            Liveness probe failed: calico/node is not ready: bird/confd is not live: Service bird is not running. Output << down: /etc/service/enabled/bird: 2s, normally up, want up >>
  Warning  Unhealthy  30m (x2 over 31m)   kubelet            Liveness probe failed: calico/node is not ready: bird/confd is not live: Service bird is not running. Output << down: /etc/service/enabled/bird: 0s, normally up, want up >>
  Warning  Unhealthy  30m                 kubelet            Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
  Normal   Pulled     30m (x2 over 31m)   kubelet            Container image "quay.io/calico/node:v3.20.2" already present on machine
  Normal   Killing    30m                 kubelet            Container calico-node failed liveness probe, will be restarted
  Warning  Unhealthy  30m (x2 over 30m)   kubelet            Liveness probe failed: calico/node is not ready: bird/confd is not live: Service bird is not running. Output << down: /etc/service/enabled/bird: 1s, normally up, want up >>
  Warning  Unhealthy  30m (x2 over 31m)   kubelet            Readiness probe failed:
  Warning  Unhealthy  16m                 kubelet            Liveness probe failed:
  Warning  Unhealthy  12m (x25 over 30m)  kubelet            (combined from similar events): Readiness probe failed: 2021-11-11 07:52:08.530 [INFO][1979] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.0.201,192.168.0.202
  Warning  Unhealthy  7m1s (x58 over 31m)  kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Normal   Started    2m4s (x5 over 26m)   kubelet  Started container upgrade-ipam

On Debian 10 work fine

@DANic-git DANic-git added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2021
@cristicalin
Copy link
Contributor

I'm facing something similar in debugging #8175

@cristicalin
Copy link
Contributor

This issue seems to be related to container_manager: containerd with container_manager: docker the deployment on Debian 11 works OK, I'm still trying to get to the bottom of this issue but I don't think it is related to calico-node itself, from my testing all pods get mangled up at some point.

Using either etcd_deployment_type: host or etcd_kubeadm_enabled: true yield the same results, the cluster crashes.

Containerd deployment: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/1793953991 but multiple runs yield different failures.
Docker deployment: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/1793953992

@cristicalin
Copy link
Contributor

cristicalin commented Nov 17, 2021

After some tinkering it seems Debian 11 switch to cgroup v2 and containerd is not yet ready for this. As a result you need to revert to cgroup v1 by adding systemd.unified_cgroup_hierarchy=0 to grub command line and rebooting the VMs.

We need to address this in CI, we already have similar mitigations for fedora31+ (roles/container-engine/containerd/tasks/main.yml), it is interesting how come docker is happy with this.

@cristicalin
Copy link
Contributor

I just concluded a test and it seems with using kubelet_cgroup_driver=cgroupfs on Debian 11 we get a working cluster, please try this in your own configuration.

@champtar
Copy link
Contributor

On CentOS 8.5 I have containerd 1.5.7 / k8s 1.22.3 / cgroup_driver systemd / cgroupv2
This is without kubespray, kubespray doesn't configure containerd properly
You need something like

version = 2

[plugins."io.containerd.grpc.v1.cri"]
  [plugins."io.containerd.grpc.v1.cri".cni]
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runc.v2"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
      BinaryName = ""
      CriuImagePath = ""
      CriuPath = ""
      CriuWorkPath = ""
      IoGid = 0
      IoUid = 0
      NoNewKeyring = false
      NoPivotRoot = false
      Root = ""
      ShimCgroup = ""
      SystemdCgroup = true

@cristicalin
Copy link
Contributor

This is how we configure containerd out of the box:

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
oom_score = 0

[grpc]
  max_recv_message_size = 16777216
  max_send_message_size = 16777216

[debug]
  level = "info"

[metrics]
  address = ""
  grpc_histogram = false

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "k8s.gcr.io/pause:3.3"
    max_container_log_line_size = -1
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      snapshotter = "overlayfs"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            systemCgroup = true
    [plugins."io.containerd.grpc.v1.cri".registry]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]

I'm guessing the systemCgroup = true vs SystemdCgroup = true is the main difference since the other keys are basically empty/defaults, I'm not 100% sure about the NoNewKeyring = false and NoPivotRoot = false keys.

@cristicalin
Copy link
Contributor

Seems like we introduced a typo in #8123 which broke containerd with systemd cgroups driver.

/cc @pasqualet

@pasqualet
Copy link
Contributor

Seems like we introduced a typo in #8123 which broke containerd with systemd cgroups driver.

/cc @pasqualet

You are right, #8123 introduced the typo but I'm not sure if it's the reason for this issue. Anyway I've created the #8206 to fix the typo.

I think we need better integration tests and #6400 seems could be the starting point to work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants