Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

Closed
pagarwal-tibco opened this issue May 12, 2021 · 27 comments

Comments

@pagarwal-tibco
Copy link

pagarwal-tibco commented May 12, 2021

Getting following error for calico-node pod

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

Steps to Reproduce (for bugs)

I am deploying calico CNI in 2 node kubernetes Kind(https://github.com/kubernetes-sigs/kind) cluster. I keep seeing following liveness probe failures with following logs

2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:53.213 [INFO][53] felix/health.go 196: Overall health status changed newStatus=&health.HealthReport{Live:false, Ready:false} 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:00.455 [INFO][56] monitor-addresses/startup.go 768: Using autodetected IPv4 address on interface eth0: 10.245.2.131/25 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:04.557 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 154: Health: not ready

Your Environment

Can someone please help?

@pagarwal-tibco
Copy link
Author

I am still facing this issue. Can someone please help?

@song-jiang
Copy link
Member

@pagarwal-tibco Did you installed Calico v3.18 on your kind cluster? What is the network backend, vxlan or BGP?

@neiljerram Could you help?

@pagarwal-tibco
Copy link
Author

@song-jiang Calico backend is "bird".

Here is the yaml file used for deploying calico. Please note that CRDs are deployed separately.
calico-all.yaml.zip

@nelljerram
Copy link
Member

@pagarwal-tibco I think we will need more logs to understand this. Could you try changing

            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"

to

            - name: FELIX_LOGSEVERITYSCREEN
              value: "debug"

and then redeploy, and attach one of the node logs here?

Also wondering about your KIND version and config. Here's a config sample from our own testing:

    ${KIND} create cluster --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "192.168.128.0/17"
nodes:
# the control plane node
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

Is yours also like that?

Our testing is using https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64. Could you try with that version - just in case something important has changed since then in KIND master?

@pagarwal-tibco
Copy link
Author

We are using KIND version 0.9 and 0.10 as we need to use Kubernetes version 1.19 and 1.20. We are using following KIND config,

apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: '192.168.0.0/16'
  serviceSubnet: '192.168.240.0/20'
  apiServerPort: 6443
nodes:
- role: control-plane
- role: worker

Calico node debug logs are here
calico.log

@nelljerram
Copy link
Member

@pagarwal-tibco Thanks for the log. It indicates that the Felix component does become live after a few seconds. So perhaps the liveness problem is in another component. Can you check what kubectl describe says for a calico-node pod when it is not becoming live? There should be a message that gives a bit more detail about the problem.

@pagarwal-tibco
Copy link
Author

@neiljerram
Calico node pod keeps toggling between ready and not ready.

I see following event for calico-node

calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  36s (x17 over 9m24s)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

@pagarwal-tibco
Copy link
Author

@neiljerram
I see same problem with calico 3.19.

  Warning  Unhealthy  13m (x14 over 23m)   kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
  Warning  Unhealthy  3m6s (x24 over 18m)  kubelet  (combined from similar events): Readiness probe failed: 2021-05-25 06:36:25.848 [INFO][6210] confd/health.go 180: Number of node(s) with BGP peering established = 1
calico/node is not ready: felix is not ready: readiness probe reporting 503

Please let me know if you need anymore information.

@pagarwal-tibco
Copy link
Author

@neiljerram
I am still facing this issue. Any pointers please?

@pagarwal-tibco
Copy link
Author

Any updates on this issue?

@lwr20
Copy link
Member

lwr20 commented Jun 15, 2021

I have seen these symptoms in a system that was starved of CPU. It might be worth trying this on a machine with more CPU?

@lmm
Copy link
Contributor

lmm commented Jun 15, 2021

Are the pod and service cidrs overlapping? Can you try removing the serviceSubnet?

@pagarwal-tibco
Copy link
Author

@lmm I tried removing serviceSubnet and using non overlapping value. But still the same issue.

@lwr20 I have a good machine and top command shows CPU is idle as well.

@lmm
Copy link
Contributor

lmm commented Jul 13, 2021

@pagarwal-tibco are you using a Linux host? I cannot repro what you're seeing and we use kind quite a bit in our automated tests. Perhaps there is something on your host that is interfering with Calico.

If you're using a Mac, there is this kind issue that might be worth looking into: kubernetes-sigs/kind#2308

@pierluigilenoci
Copy link

@caseydavenport why you closed the ticket?

@nelljerram
Copy link
Member

@pierluigilenoci I presume because the OP did not respond since 13th July?

@pierluigilenoci
Copy link

A month is not that long. Maybe he took the covid or is on vacation. Let's try to stimulate him...

@pagarwal-tibco knock knock!
dddd

@caseydavenport
Copy link
Member

A month is plenty long - we usually close tickets without a response in 2-3 weeks. We can always re-open if the OP returns.

@pagarwal-tibco
Copy link
Author

Sorry for late reply, I was away. I upgraded docker for mac to 3.6.0 and I confirm that it works now. So it seems that the issue was caused by docker for mac.

Thanks for all the help.

@nelljerram
Copy link
Member

Thanks @pagarwal-tibco !

@ciiiii
Copy link

ciiiii commented Oct 9, 2021

The same problem on k8s node(Ubuntu 18.04.5 LTS/5.4.0-60-generic)

Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  11m (x8338 over 10d)  kubelet  (combined from similar events): Readiness probe failed: 2021-10-09 06:49:07.655 [INFO][27506] confd/health.go 180: Number of node(s) with BGP peering established = 76
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  5m17s (x6281 over 20d)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

@nelljerram
Copy link
Member

@ciiiii Please open a new issue, and describe

  • the Calico version
  • your cluster setup process
  • whether you see this liveness issue immediately after cluster setup, or if it occurs after you've made some other change.

@fcolista
Copy link

Seems that nobody cares about this issue...

@gondaz
Copy link

gondaz commented Apr 18, 2022

I've been struggling with this issue for past few days and managed to fix this by editing a clusterrole resource.
I have an RKE-based cluster (version 1.21.10), and I upgraded calico related images up to 3.21.5, after that the initial healthcheck issue had cropped up.
Make sure you have the proper clusterrole manifest as following (copied from the original Calico website):

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # The CNI plugin needs to get pods, nodes, and namespaces.
  - apiGroups: [""]
    resources:
      - pods
      - nodes
      - namespaces
    verbs:
      - get
  # EndpointSlices are used for Service-based network policy rule
  # enforcement.
  - apiGroups: ["discovery.k8s.io"]
    resources:
      - endpointslices
    verbs:
      - watch
      - list
  - apiGroups: [""]
    resources:
      - endpoints
      - services
    verbs:
      # Used to discover service IPs for advertisement.
      - watch
      - list
      # Used to discover Typhas.
      - get
  # Pod CIDR auto-detection on kubeadm needs access to config maps.
  - apiGroups: [""]
    resources:
      - configmaps
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - nodes/status
    verbs:
      # Needed for clearing NodeNetworkUnavailable flag.
      - patch
      # Calico stores some configuration information in node annotations.
      - update
  # Watch for changes to Kubernetes NetworkPolicies.
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
    verbs:
      - watch
      - list
  # Used by Calico for policy information.
  - apiGroups: [""]
    resources:
      - pods
      - namespaces
      - serviceaccounts
    verbs:
      - list
      - watch
  # The CNI plugin patches pods/status.
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - patch
  # Calico monitors various CRDs for config.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - felixconfigurations
      - bgppeers
      - globalbgpconfigs
      - bgpconfigurations
      - ippools
      - ipamblocks
      - globalnetworkpolicies
      - globalnetworksets
      - networkpolicies
      - networksets
      - clusterinformations
      - hostendpoints
      - blockaffinities
      - caliconodestatuses
    verbs:
      - get
      - list
      - watch
  # Calico must create and update some CRDs on startup.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
      - felixconfigurations
      - clusterinformations
    verbs:
      - create
      - update
  # Calico stores some configuration information on the node.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  # These permissions are required for Calico CNI to perform IPAM allocations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ipamconfigs
    verbs:
      - get
  # Block affinities must also be watchable by confd for route aggregation.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
    verbs:
      - watch

Hopefully it helps.

@willzhang
Copy link

I upgrade calico version resolved my probles, see kubesphere/kubekey#1282

@rajaie-sg
Copy link

We ran into a similar issue and were able to resolve it by setting CPU requests for calico-node Pods. #3420 (comment)

@LiAuTraver
Copy link

I checked my log and found I forgot to instll ipset which it requires to use. (I'm in a hush tho)
So I just installed it and then the problem disappears.😂
maybe I'm dumb here, but I just post here if anyone runs into the same situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests