system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

CheyiLin · 2020-11-10T10:08:29Z

1. What kops version are you running?
Version 1.18.2 (git-84495481e4)

2. What Kubernetes version are you running?

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T01:06:31Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.13", GitCommit:"30d651da517185653e34e7ab99a792be6a3d9495", GitTreeState:"clean", BuildDate:"2020-10-15T00:59:17Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
kops rolling-update cluster --name my-cluster-name --yes

5. What happened after the commands executed?
The cluster cannot pass the validation, wait for the rescheduled system-cluster-critical pod calico-kube-controllers to be ready.

kops logs

...
I1110 13:55:22.615378   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers), system-cluster-critical pod "metrics-server-6dffd46f94-27fv5" is pending.
I1110 13:55:54.935669   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:56:27.642298   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:56:59.836149   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:57:32.576827   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:58:05.432661   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:58:38.093650   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
I1110 13:59:10.570779   87661 instancegroups.go:440] Cluster did not pass validation, will retry in "30s": system-cluster-critical pod "calico-kube-controllers-79c8f4bdc4-6tft2" is not ready (calico-kube-controllers).
...

pod calico-kube-controllers logs

2020-11-10 05:46:17.832 [INFO][1] ipam.go 45: Synchronizing IPAM data                                                                                                                                                                         
2020-11-10 05:46:17.927 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-104-191-29.us-west-2.compute.internal) with error: nodes "ip-10-104-191-29.us-west-2.compute.inte 
2020-11-10 05:46:17.927 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-104-191-29.us-west-2.compute.internal" k8sNode=""                                                                                                              
2020-11-10 05:46:17.932 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-104-191-29.us-west-2.compute.internal" k8sNode=""                                                                                
2020-11-10 05:46:17.932 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-104-191-29.us-west-2.compute.internal'                                                                                                  
2020-11-10 05:46:18.066 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.066 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.36e91144959c16cd8a006ee6d91c7f7d5b771f38bb0cfdf692049cee6f545de1'                                                                              
2020-11-10 05:46:18.117 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.117 [INFO][1] ipam.go 1173: Releasing all IPs with handle 'k8s-pod-network.73961784a28b84cd384d6bda1d76c902a0b317baf33be5bd87fce0adc6d322f6'                                                                              
2020-11-10 05:46:18.169 [INFO][1] ipam.go 1492: Node doesn't exist, no need to release affinity cidr=100.119.230.64/26 host="ip-10-104-191-29.us-west-2.compute.internal"                                                                     
2020-11-10 05:46:18.251 [INFO][1] ipam.go 190: Node and IPAM data is in sync                                                                                                                                                                  
2020-11-10 05:46:51.498 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:46:51.498 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:47:23.501 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:47:43.502 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:47:43.502 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:48:15.503 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:48:35.504 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:48:35.504 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:49:07.504 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:49:27.505 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:49:27.505 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:49:59.505 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:50:19.506 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:50:19.506 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:50:51.507 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:51:11.508 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:51:11.508 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:51:43.509 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:52:03.509 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:52:03.510 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:52:35.512 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:52:55.513 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:52:55.513 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:53:27.513 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:53:47.514 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:53:47.514 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:54:19.514 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:54:39.515 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:54:39.515 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:55:11.516 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:55:31.517 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:55:31.517 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:56:03.517 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:56:23.518 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:56:23.518 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:56:55.519 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:57:15.519 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:57:15.519 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                            
2020-11-10 05:57:47.520 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                             
2020-11-10 05:58:07.520 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceed 
2020-11-10 05:58:07.520 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                               
...

6. What did you expect to happen?
The rescheduled pod calico-kube-controllers will be ready in few minutes.

7. Please provide your cluster manifest.

kops cluster & ig manifests

---
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-name.k8s.local
spec:
  additionalNetworkCIDRs:
  - my-network-cidr
  api:
    loadBalancer:
      type: Internal
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-k8s-state/my-cluster-name.k8s.local
  etcdClusters:
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: main
  - etcdMembers:
    - encryptedVolume: true
      instanceGroup: master-a1
      name: master-a1
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
    node:
    - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    disableBasicAuth: true
    eventTTL: 3h0m0s
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
  kubeDNS:
    coreDNSImage: k8s.gcr.io/coredns:1.7.0
    externalCoreFile: |-
      # forward AWS records to Amazon DNS server within VPC
      amazonaws.com:53 {
        errors
        prometheus :9153
        forward . 169.254.169.253
        cache 30
      }
      # forward external records to public DNS servers
      # public DNS providers: Google > Cloudflare > Quad9 (IBM)
      com net org io {
        errors
        prometheus :9153
        forward . 8.8.8.8 1.1.1.1 9.9.9.9 {
          policy sequential
        }
        cache 30
      }
      .:53 {
        log . "* - {type} {class} {name} {proto} {size} {rcode} {rsize} {duration} > {remote}"
        errors
        health
        autopath @kubernetes
        kubernetes cluster.local. in-addr.arpa ip6.arpa {
          pods verified
          endpoint_pod_names
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        loadbalance
        reload
        ready
      }
    nodeLocalDNS:
      enabled: true
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    enforceNodeAllocatable: pods
    kubeReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
    systemReserved:
      cpu: 100m
      ephemeral-storage: 1Gi
      memory: 200Mi
  kubernetesApiAccess:
  - 10.0.0.0/8
  kubernetesVersion: 1.17.13
  networkCIDR: my-network-cidr
  networkID: my-vpc-id
  networking:
    calico:
      crossSubnet: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.0.0.0/8
  subnets:
  - id: subnet-000001
    name: subnet-a1
    type: Private
    zone: us-west-2a
  - id: subnet-000002
    name: subnet-b1
    type: Private
    zone: us-west-2b
  - id: subnet-000003
    name: subnet-c1
    type: Private
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: master-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/disabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: master-a1
  role: Master
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-a1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-a1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-a1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-b1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-b1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-b1
  suspendProcesses:
  - AZRebalance

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: my-cluster-name.k8s.local
  name: nodes-c1
spec:
  associatePublicIp: false
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    k8s.io/cluster-autoscaler/my-cluster-name.k8s.local: owned
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20201014
  machineType: t3a.large
  maxPrice: "0.04"
  maxSize: 3
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3a.large
    - t3.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: nodes-c1
  role: Node
  rootVolumeOptimization: true
  subnets:
  - subnet-c1
  suspendProcesses:
  - AZRebalance

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Not available yet.

9. Anything else do we need to know?

The cluster uses calico CNI with cross-subnet mode enabled.

  networking:
    calico:
      crossSubnet: true

Deleting the not-ready pod calico-kube-controllers manually solves the issue, the new created pod works normally, and the cluster can pass the validation, continue the rolling update process.

new created pod calico-kube-controllers logs

2020-11-10 06:00:04.995 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kuberne 
W1110 06:00:04.997519       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                                                                                        
2020-11-10 06:00:04.998 [INFO][1] main.go 109: Ensuring Calico datastore is initialized                                                                                                                                                       
2020-11-10 06:00:05.009 [INFO][1] main.go 149: Getting initial config snapshot from datastore                                                                                                                                                 
2020-11-10 06:00:05.029 [INFO][1] main.go 152: Got initial config snapshot                                                                                                                                                                    
2020-11-10 06:00:05.029 [INFO][1] watchersyncer.go 89: Start called                                                                                                                                                                           
2020-11-10 06:00:05.029 [INFO][1] main.go 169: Starting status report routine                                                                                                                                                                 
2020-11-10 06:00:05.030 [INFO][1] main.go 402: Starting controller ControllerType="Node"                                                                                                                                                      
2020-11-10 06:00:05.030 [INFO][1] node_controller.go 138: Starting Node controller                                                                                                                                                            
2020-11-10 06:00:05.030 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready                                                                                                                                           
2020-11-10 06:00:05.030 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready                                                                                                                                    
2020-11-10 06:00:05.030 [INFO][1] watchersyncer.go 147: Starting main event processing loop                                                                                                                                                   
2020-11-10 06:00:05.031 [INFO][1] resources.go 343: Main client watcher loop                                                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] watchercache.go 297: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"                                                                                                          
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 127: Sending status update Status=resync                                                                                                                                                   
2020-11-10 06:00:05.037 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync                                                                                                                                            
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync                                                                                                                           
2020-11-10 06:00:05.037 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync                                                                                                                                                  
2020-11-10 06:00:05.037 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync                                                                                                                                           
2020-11-10 06:00:05.046 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints                                                                                                                                                  
2020-11-10 06:00:05.130 [INFO][1] node_controller.go 151: Node controller is now running                                                                                                                                                      
2020-11-10 06:00:05.130 [INFO][1] ipam.go 45: Synchronizing IPAM data                                                                                                                                                                         
2020-11-10 06:00:05.155 [INFO][1] ipam.go 190: Node and IPAM data is in sync
...

The text was updated successfully, but these errors were encountered:

zetaab · 2020-11-10T11:00:51Z

for me it sometimes can take longer than 10 minutes, that is why I use 20minutes max time for waiting

CheyiLin · 2020-11-10T11:15:41Z

@zetaab Good to know that we're not alone. 😆
But if I manually delete the not-ready pod calico-kube-controllers, the new created pod will work normally and the cluster will pass the validation immediately.
In my opinion there should be something we can improve/fix to speed up the rolling update process.

hakman · 2020-11-10T13:15:31Z

@CheyiLin This is a Calico and client-go issue, unrelated to kOps. Please more about it in:
projectcalico/calico#3751
https://github.com/projectcalico/libcalico-go/issues/1267
kubernetes/client-go#374

CheyiLin · 2020-11-10T15:30:50Z

@hakman The issue projectcalico/calico#3751 is exactly what I reported here.
Thanks for the context and pointing this out!

hakman · 2020-11-10T15:36:42Z

No problem. Keep an eye on those. I see there's some activity.

rtacconi · 2024-01-03T23:40:27Z

@hakman
your link is broken https://github.com/projectcalico/libcalico-go/issues/1267

CheyiLin closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

CheyiLin commented Nov 10, 2020

zetaab commented Nov 10, 2020

CheyiLin commented Nov 10, 2020

hakman commented Nov 10, 2020 •

edited

Loading

CheyiLin commented Nov 10, 2020

hakman commented Nov 10, 2020

rtacconi commented Jan 3, 2024

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update #10207

Comments

CheyiLin commented Nov 10, 2020

zetaab commented Nov 10, 2020

CheyiLin commented Nov 10, 2020

hakman commented Nov 10, 2020 • edited Loading

CheyiLin commented Nov 10, 2020

hakman commented Nov 10, 2020

rtacconi commented Jan 3, 2024

hakman commented Nov 10, 2020 •

edited

Loading