Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

pagarwal-tibco · 2021-05-12T09:14:45Z

Getting following error for calico-node pod

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

Steps to Reproduce (for bugs)

I am deploying calico CNI in 2 node kubernetes Kind(https://github.com/kubernetes-sigs/kind) cluster. I keep seeing following liveness probe failures with following logs

2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:53.213 [INFO][53] felix/health.go 196: Overall health status changed newStatus=&health.HealthReport{Live:false, Ready:false} 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:00.455 [INFO][56] monitor-addresses/startup.go 768: Using autodetected IPv4 address on interface eth0: 10.245.2.131/25 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:04.557 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 154: Health: not ready

Your Environment

Calico version: v3.18
Kubernetes Kind(https://github.com/kubernetes-sigs/kind)

Can someone please help?

The text was updated successfully, but these errors were encountered:

pagarwal-tibco · 2021-05-20T04:52:34Z

I am still facing this issue. Can someone please help?

song-jiang · 2021-05-20T16:19:33Z

@pagarwal-tibco Did you installed Calico v3.18 on your kind cluster? What is the network backend, vxlan or BGP?

@neiljerram Could you help?

pagarwal-tibco · 2021-05-21T03:54:12Z

@song-jiang Calico backend is "bird".

Here is the yaml file used for deploying calico. Please note that CRDs are deployed separately.
calico-all.yaml.zip

nelljerram · 2021-05-21T08:33:04Z

@pagarwal-tibco I think we will need more logs to understand this. Could you try changing

            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"

to

            - name: FELIX_LOGSEVERITYSCREEN
              value: "debug"

and then redeploy, and attach one of the node logs here?

Also wondering about your KIND version and config. Here's a config sample from our own testing:

    ${KIND} create cluster --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "192.168.128.0/17"
nodes:
# the control plane node
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

Is yours also like that?

Our testing is using https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64. Could you try with that version - just in case something important has changed since then in KIND master?

pagarwal-tibco · 2021-05-21T12:20:25Z

We are using KIND version 0.9 and 0.10 as we need to use Kubernetes version 1.19 and 1.20. We are using following KIND config,

apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: '192.168.0.0/16'
  serviceSubnet: '192.168.240.0/20'
  apiServerPort: 6443
nodes:
- role: control-plane
- role: worker

Calico node debug logs are here
calico.log

nelljerram · 2021-05-21T12:56:37Z

@pagarwal-tibco Thanks for the log. It indicates that the Felix component does become live after a few seconds. So perhaps the liveness problem is in another component. Can you check what kubectl describe says for a calico-node pod when it is not becoming live? There should be a message that gives a bit more detail about the problem.

pagarwal-tibco · 2021-05-22T06:58:14Z

@neiljerram
Calico node pod keeps toggling between ready and not ready.

I see following event for calico-node

calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  36s (x17 over 9m24s)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

pagarwal-tibco · 2021-05-25T06:40:47Z

@neiljerram
I see same problem with calico 3.19.

  Warning  Unhealthy  13m (x14 over 23m)   kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
  Warning  Unhealthy  3m6s (x24 over 18m)  kubelet  (combined from similar events): Readiness probe failed: 2021-05-25 06:36:25.848 [INFO][6210] confd/health.go 180: Number of node(s) with BGP peering established = 1
calico/node is not ready: felix is not ready: readiness probe reporting 503

Please let me know if you need anymore information.

pagarwal-tibco · 2021-05-31T07:51:41Z

@neiljerram
I am still facing this issue. Any pointers please?

pagarwal-tibco · 2021-06-10T13:26:05Z

Any updates on this issue?

lwr20 · 2021-06-15T16:53:27Z

I have seen these symptoms in a system that was starved of CPU. It might be worth trying this on a machine with more CPU?

lmm · 2021-06-15T17:28:02Z

Are the pod and service cidrs overlapping? Can you try removing the serviceSubnet?

pagarwal-tibco · 2021-06-21T15:56:02Z

@lmm I tried removing serviceSubnet and using non overlapping value. But still the same issue.

@lwr20 I have a good machine and top command shows CPU is idle as well.

lmm · 2021-07-13T16:31:11Z

@pagarwal-tibco are you using a Linux host? I cannot repro what you're seeing and we use kind quite a bit in our automated tests. Perhaps there is something on your host that is interfering with Calico.

If you're using a Mac, there is this kind issue that might be worth looking into: kubernetes-sigs/kind#2308

pierluigilenoci · 2021-08-17T08:57:19Z

@caseydavenport why you closed the ticket?

nelljerram · 2021-08-17T11:22:40Z

@pierluigilenoci I presume because the OP did not respond since 13th July?

pierluigilenoci · 2021-08-17T13:07:05Z

A month is not that long. Maybe he took the covid or is on vacation. Let's try to stimulate him...

@pagarwal-tibco knock knock!

caseydavenport · 2021-08-17T18:17:14Z

A month is plenty long - we usually close tickets without a response in 2-3 weeks. We can always re-open if the OP returns.

pagarwal-tibco · 2021-08-18T04:03:52Z

Sorry for late reply, I was away. I upgraded docker for mac to 3.6.0 and I confirm that it works now. So it seems that the issue was caused by docker for mac.

Thanks for all the help.

nelljerram · 2021-08-19T09:54:58Z

Thanks @pagarwal-tibco !

ciiiii · 2021-10-09T07:02:12Z

The same problem on k8s node(Ubuntu 18.04.5 LTS/5.4.0-60-generic)

Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  11m (x8338 over 10d)  kubelet  (combined from similar events): Readiness probe failed: 2021-10-09 06:49:07.655 [INFO][27506] confd/health.go 180: Number of node(s) with BGP peering established = 76
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  5m17s (x6281 over 20d)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

nelljerram · 2021-10-11T09:03:42Z

@ciiiii Please open a new issue, and describe

the Calico version
your cluster setup process
whether you see this liveness issue immediately after cluster setup, or if it occurs after you've made some other change.

fcolista · 2022-04-14T20:04:45Z

Seems that nobody cares about this issue...

gondaz · 2022-04-18T05:56:47Z

I've been struggling with this issue for past few days and managed to fix this by editing a clusterrole resource.
I have an RKE-based cluster (version 1.21.10), and I upgraded calico related images up to 3.21.5, after that the initial healthcheck issue had cropped up.
Make sure you have the proper clusterrole manifest as following (copied from the original Calico website):

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # The CNI plugin needs to get pods, nodes, and namespaces.
  - apiGroups: [""]
    resources:
      - pods
      - nodes
      - namespaces
    verbs:
      - get
  # EndpointSlices are used for Service-based network policy rule
  # enforcement.
  - apiGroups: ["discovery.k8s.io"]
    resources:
      - endpointslices
    verbs:
      - watch
      - list
  - apiGroups: [""]
    resources:
      - endpoints
      - services
    verbs:
      # Used to discover service IPs for advertisement.
      - watch
      - list
      # Used to discover Typhas.
      - get
  # Pod CIDR auto-detection on kubeadm needs access to config maps.
  - apiGroups: [""]
    resources:
      - configmaps
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - nodes/status
    verbs:
      # Needed for clearing NodeNetworkUnavailable flag.
      - patch
      # Calico stores some configuration information in node annotations.
      - update
  # Watch for changes to Kubernetes NetworkPolicies.
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
    verbs:
      - watch
      - list
  # Used by Calico for policy information.
  - apiGroups: [""]
    resources:
      - pods
      - namespaces
      - serviceaccounts
    verbs:
      - list
      - watch
  # The CNI plugin patches pods/status.
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - patch
  # Calico monitors various CRDs for config.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - felixconfigurations
      - bgppeers
      - globalbgpconfigs
      - bgpconfigurations
      - ippools
      - ipamblocks
      - globalnetworkpolicies
      - globalnetworksets
      - networkpolicies
      - networksets
      - clusterinformations
      - hostendpoints
      - blockaffinities
      - caliconodestatuses
    verbs:
      - get
      - list
      - watch
  # Calico must create and update some CRDs on startup.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
      - felixconfigurations
      - clusterinformations
    verbs:
      - create
      - update
  # Calico stores some configuration information on the node.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  # These permissions are required for Calico CNI to perform IPAM allocations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ipamconfigs
    verbs:
      - get
  # Block affinities must also be watchable by confd for route aggregation.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
    verbs:
      - watch

Hopefully it helps.

willzhang · 2022-05-17T01:50:50Z

I upgrade calico version resolved my probles, see kubesphere/kubekey#1282

rajaie-sg · 2023-03-14T21:59:45Z

We ran into a similar issue and were able to resolve it by setting CPU requests for calico-node Pods. #3420 (comment)

LiAuTraver · 2024-10-07T05:31:30Z

I checked my log and found I forgot to instll ipset which it requires to use. (I'm in a hush tho)
So I just installed it and then the problem disappears.😂
maybe I'm dumb here, but I just post here if anyone runs into the same situation

lmm added the kind/support label Jul 13, 2021

caseydavenport closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

pagarwal-tibco commented May 12, 2021 •

edited

Loading

pagarwal-tibco commented May 20, 2021

song-jiang commented May 20, 2021

pagarwal-tibco commented May 21, 2021

nelljerram commented May 21, 2021

pagarwal-tibco commented May 21, 2021

nelljerram commented May 21, 2021

pagarwal-tibco commented May 22, 2021

pagarwal-tibco commented May 25, 2021

pagarwal-tibco commented May 31, 2021

pagarwal-tibco commented Jun 10, 2021

lwr20 commented Jun 15, 2021

lmm commented Jun 15, 2021

pagarwal-tibco commented Jun 21, 2021

lmm commented Jul 13, 2021

pierluigilenoci commented Aug 17, 2021

nelljerram commented Aug 17, 2021

pierluigilenoci commented Aug 17, 2021

caseydavenport commented Aug 17, 2021

pagarwal-tibco commented Aug 18, 2021

nelljerram commented Aug 19, 2021

ciiiii commented Oct 9, 2021

nelljerram commented Oct 11, 2021

fcolista commented Apr 14, 2022

gondaz commented Apr 18, 2022 •

edited

Loading

willzhang commented May 17, 2022

rajaie-sg commented Mar 14, 2023

LiAuTraver commented Oct 7, 2024

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

Comments

pagarwal-tibco commented May 12, 2021 • edited Loading

Steps to Reproduce (for bugs)

Your Environment

pagarwal-tibco commented May 20, 2021

song-jiang commented May 20, 2021

pagarwal-tibco commented May 21, 2021

nelljerram commented May 21, 2021

pagarwal-tibco commented May 21, 2021

nelljerram commented May 21, 2021

pagarwal-tibco commented May 22, 2021

pagarwal-tibco commented May 25, 2021

pagarwal-tibco commented May 31, 2021

pagarwal-tibco commented Jun 10, 2021

lwr20 commented Jun 15, 2021

lmm commented Jun 15, 2021

pagarwal-tibco commented Jun 21, 2021

lmm commented Jul 13, 2021

pierluigilenoci commented Aug 17, 2021

nelljerram commented Aug 17, 2021

pierluigilenoci commented Aug 17, 2021

caseydavenport commented Aug 17, 2021

pagarwal-tibco commented Aug 18, 2021

nelljerram commented Aug 19, 2021

ciiiii commented Oct 9, 2021

nelljerram commented Oct 11, 2021

fcolista commented Apr 14, 2022

gondaz commented Apr 18, 2022 • edited Loading

willzhang commented May 17, 2022

rajaie-sg commented Mar 14, 2023

LiAuTraver commented Oct 7, 2024

pagarwal-tibco commented May 12, 2021 •

edited

Loading

gondaz commented Apr 18, 2022 •

edited

Loading