Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-kube-controllers pod stuck in not Ready for 13 min #3751

Closed
hakman opened this issue Jul 6, 2020 · 22 comments
Closed

calico-kube-controllers pod stuck in not Ready for 13 min #3751

hakman opened this issue Jul 6, 2020 · 22 comments

Comments

@hakman
Copy link
Contributor

hakman commented Jul 6, 2020

In a Kubernetes cluster created with Kops, replacing the master node(s) puts the calico-kube-controllers pod in not Ready state.
It recovers on its own after about 13 min, which is quite slow.
Deleting the pod, creates a new one that becomes ready instantly.

Expected Behavior

calico-kube-controllers should recover much faster than 13 min.

Current Behavior

calico-kube-controllers waits 13 min to recover.

Possible Solution

Simplest generic fix would be to add a liveness probe that automatically restarts the pod.

Steps to Reproduce (for bugs)

  1. Create a simple Kubernetes cluster using Kops v1.17.1, using --networking=calico.
    This should provide the steps: https://kops.sigs.k8s.io/getting_started/aws/.
  2. Build the cluster:
$ kops update cluster --yes
  1. Validate the cluster:
kops validate cluster --wait 15m
  1. Replace the master node:
kops rolling-update cluster --yes --cloudonly --instance-group master-a --force`
  1. Wait for a new master to be created and check the status of the calico-kube-controllers pod.
kubectl logs -f -n kube-system calico-kube-controllers-76bd59c54c-57j6r
2020-07-06 04:38:41.397 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", ReconcilerPeriod:"5m", CompactionPeriod:"10m", EnabledControllers:"node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", HealthEnabled:true, SyncNodeLabels:true, DatastoreType:"kubernetes"}
W0706 04:38:41.398065       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-07-06 04:38:41.398 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 89: Start called
2020-07-06 04:38:41.409 [INFO][1] main.go 183: Starting status report routine
2020-07-06 04:38:41.409 [INFO][1] main.go 368: Starting controller ControllerType="Node"
2020-07-06 04:38:41.409 [INFO][1] node_controller.go 130: Starting Node controller
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2020-07-06 04:38:41.409 [INFO][1] node_syncer.go 39: Node controller syncer status updated: wait-for-ready
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2020-07-06 04:38:41.416 [INFO][1] watchercache.go 291: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2020-07-06 04:38:41.416 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2020-07-06 04:38:41.417 [INFO][1] node_syncer.go 39: Node controller syncer status updated: resync
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2020-07-06 04:38:41.417 [INFO][1] node_syncer.go 39: Node controller syncer status updated: in-sync
2020-07-06 04:38:41.509 [INFO][1] node_controller.go 143: Node controller is now running
2020-07-06 04:38:41.509 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 04:38:41.541 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2020-07-06 04:39:31.537 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 04:39:31.580 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-4-56-126.eu-west-1.compute.internal) with error: nodes "ip-10-4-56-126.eu-west-1.compute.internal" not found
2020-07-06 04:39:31.581 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-4-56-126.eu-west-1.compute.internal" k8sNode=""
2020-07-06 04:39:31.586 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-4-56-126.eu-west-1.compute.internal" k8sNode=""
2020-07-06 04:39:31.586 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.cbee49f2bf9d3ae7c4633561ccab65a8a2390d1e47b39b8b1dc572e47e6261ea'
2020-07-06 04:39:31.603 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.603 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.29600955ba32b51269bf6e9db403a3abb4d854e0f57e8958e038a82f7021a596'
2020-07-06 04:39:31.618 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.618 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.afe9203ddad2fecf4a883fb3a778f8ebd2a9174ba6c0bc291914435ec6c0054d'
2020-07-06 04:39:31.634 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.634 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-4-56-126.eu-west-1.compute.internal'
2020-07-06 04:39:31.649 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.649 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.6330af27335563bada4a42b904340abde31da0fdb8b619339e295e3102ef1ddc'
2020-07-06 04:39:31.665 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.665 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.b7d1145f4e32b7fac254f5a38b332ff1304addb7928240109f9473d4bac7e9e1'
2020-07-06 04:39:31.680 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.736 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2020-07-06 09:28:18.489 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:28:18.489 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:28:50.489 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:10.489 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:10.489 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:42.490 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:02.490 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:02.490 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:34.490 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:54.491 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:54.491 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:26.491 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:46.491 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:46.492 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:18.492 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:38.492 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:38.492 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:10.493 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:30.493 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:30.493 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:02.493 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:22.494 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:22.494 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:54.494 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:14.494 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:14.494 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:46.495 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:06.495 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:06.495 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:38.495 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:58.496 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:58.496 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:30.496 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:50.497 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:50.497 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:22.497 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:42.497 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:42.497 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:14.497 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:34.498 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:34.498 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:06.498 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:26.499 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:26.499 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:58.499 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:18.499 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:18.499 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:50.500 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:10.500 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:10.500 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:42.501 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:02.501 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:02.501 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:34.501 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:54.502 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:54.502 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:26.502 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:46.502 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:46.502 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:45:18.503 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
W0706 09:45:31.822278       1 reflector.go:299] pkg/mod/k8s.io/client-go@v0.0.0-20191114101535-6c5935290e33/tools/cache/reflector.go:96: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host") has prevented the request from succeeding
2020-07-06 09:45:31.822 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host
2020-07-06 09:45:31.822 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host
2020-07-06 09:45:32.826 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 09:45:32.844 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-4-54-58.eu-west-1.compute.internal) with error: nodes "ip-10-4-54-58.eu-west-1.compute.internal" not found
2020-07-06 09:45:32.844 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-4-54-58.eu-west-1.compute.internal" k8sNode=""
2020-07-06 09:45:32.849 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-4-54-58.eu-west-1.compute.internal" k8sNode=""
2020-07-06 09:45:32.849 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-4-54-58.eu-west-1.compute.internal'
2020-07-06 09:45:32.864 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.864 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.bf8da93299c06bedff227fc91f4ef3e6193776c90f49fdd67ff23c0cbc8b582b'
2020-07-06 09:45:32.879 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.879 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.c6e7449430362a2b07b5a86fcb65302610b647e5f74e8770d50108d60bc2aa33'
2020-07-06 09:45:32.897 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.897 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.dc2c3a527c56bdd6cdd40436d679faa72ed066c3ad2e2f0c1dc4fa712b88d4c9'
2020-07-06 09:45:32.912 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.944 [INFO][1] ipam.go 190: Node and IPAM data is in sync
^C

kubectl describe pod calico-kube-controllers-76bd59c54c-57j6r -n kube-system | grep Events: -A 10
Events:
  Type     Reason     Age                 From                                               Message
  ----     ------     ----                ----                                               -------
  Warning  Unhealthy  37m                 kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error reaching apiserver: taking a long time to check apiserver; Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning  Unhealthy  30m (x24 over 41m)  kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0
  Warning  Unhealthy  25m (x48 over 41m)  kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver

Context

Kops validates the cluster based on the status of the kube-system pods. This issue prevents the cluster from being upgraded without manual intervention and also slows it down.

Your Environment

@hakman
Copy link
Contributor Author

hakman commented Jul 6, 2020

CC: @lwr20

@fasaxc
Copy link
Member

fasaxc commented Jul 6, 2020

@hakman do you happen to use a sticky service for the API server?

@hakman
Copy link
Contributor Author

hakman commented Jul 6, 2020

This is how the service looks on that cluster @fasaxc:

% kubectl describe service kubernetes

Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                100.64.0.1
Port:              https  443/TCP
TargetPort:        443/TCP
Endpoints:         10.4.39.68:443
Session Affinity:  None
Events:            <none>

@fasaxc
Copy link
Member

fasaxc commented Jul 6, 2020

10.4.39.68:443

Only one endpoint? Shouldn't it have one per control-plane node?

@hakman
Copy link
Contributor Author

hakman commented Jul 6, 2020

This is a cluster with a single master.
Happens similarly in a cluster with 3 masters. In that case, all 3 would be in the list.

@fasaxc
Copy link
Member

fasaxc commented Jul 9, 2020

I think another user has identified the root cause here: https://github.com/projectcalico/libcalico-go/issues/1267

@hakman
Copy link
Contributor Author

hakman commented Jul 9, 2020

Nice! Thanks for the update.

@caseydavenport
Copy link
Member

I'm going to close this since we're tracking the root cause in https://github.com/projectcalico/libcalico-go/issues/1267

@alok87
Copy link

alok87 commented Jan 10, 2022

This link does not work https://github.com/projectcalico/libcalico-go/issues/1267

How to go about fixing this @hakman
This happened to us whenever our master got replaced

@MattLangsenkamp
Copy link

I am also not able to see the link

@lwr20
Copy link
Member

lwr20 commented Feb 22, 2022

I'm going to close this since we're tracking the root cause in https://github.com/projectcalico/libcalico-go/issues/1267
@caseydavenport since the move to monorepo, what's the new link to this issue?

@lwr20
Copy link
Member

lwr20 commented Feb 22, 2022

FWIW, this PR claims to fix https://github.com/projectcalico/libcalico-go/issues/1267:
projectcalico/libcalico-go#1356, so it should be in recent versions of calico.

@caseydavenport
Copy link
Member

Yep, the underlying issue was fixed in Calico v3.18 and should be fixed in subsequent releases as well.

I can't seem to find the original GH issue link since it was migrated, but that was the fix.

@RsheikhAii3
Copy link

This is a cluster with a single master. Happens similarly in a cluster with 3 masters. In that case, all 3 would be in the list.
Yep, the underlying issue was fixed in Calico v3.18 and should be fixed in subsequent releases as well.

I can't seem to find the original GH issue link since it was migrated, but that was the fix.

I have searched in vain to find that issue link since https://github.com/projectcalico/libcalico-go/issues/1267 moved to https://github.com/projectcalico/calico/issues (no 1267) in desperation reaching out to see if you remember what the fix was ...

@hakman
Copy link
Contributor Author

hakman commented Jun 29, 2022

kubernetes/client-go#374 was the actual root cause, fixed by kubernetes/kubernetes#95981.
On the Calico side the fix was to update kubernetes/client-go to v1.20.0+.

@RsheikhAii3
Copy link

kubernetes/client-go#374 was the actual root cause, fixed by kubernetes/kubernetes#95981. On the Calico side the fix was to update kubernetes/client-go to v1.20.0+.

Thank you for your reply. I debated posting this. Bit embarrassed since I am new to k8s, unsure if would waste your time. I am experiencing the

ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2022-06-29 01:11:48.762 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

1 master and one worker, Ubuntu-AWS via kubeadm
Client Version: v1.23.1
Server Version: v1.23.1
GoVersion:"go1.17.5" -client & master
calico ver 3.23.1
calico-kube-controller is running on worker node

I upgraded from 1.22.1-00 to 1.23.1-00 and on right after experienced this (since I have stopped instances twice) and researching probable fix last few days..w/o any success.

kube-system calico-kube-controllers-685b65ddf9-pnqwp 0/1 CrashLoopBackOff 13 (4m47s ago) 48m
I have tried to edit timeOut values to 60 from1 on readiness and liveprobe to no avail.

I am sure you won't have time, is there a forum you could guide me to research?

Thank you in advance.

@fasaxc
Copy link
Member

fasaxc commented Jun 29, 2022

@RsheikhAii3 I don't think your problem is related to this issue but I'm sure we'll be able to help you on our slack: https://slack.projectcalico.org/

@RsheikhAii3
Copy link

@fasaxc Much appreciate it sir, indeed you are correct I will pursue it on the slack channel. Just for doc purpose, on further search, the issue on calico-node logs was address already used could not bind for both on master and worker.

TY to all of you. Appreciate your time and knowledge when you guide newbies

@joeybdub
Copy link

Any update on the issue?

@hakman
Copy link
Contributor Author

hakman commented Oct 13, 2022

@joeybdub As mentioned before, this IS fixed. If you have a similar issue, it's just something that looks similar, nothing more.
Would be best to create a new issue or try via Slack, as you may get help faster. There are some really cool and helpful people there. 😉

@joeybdub
Copy link

Thanks @hakman there is an issue already for the issue experiencing Azure/AKS#2745

@hakman
Copy link
Contributor Author

hakman commented Oct 13, 2022

@joeybdub The AKS issue seems unrelated. Your best guess is still Slack where there may be someone more familiar with AKS that can help. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants