GKE nodes going to NotReady state after installing capsule #597

MeghanaSrinath · 2022-06-23T13:02:05Z

We have a GKE cluster on which we have deployed capsule. We have used this values.yaml file during installation via helm. But once the cluster is scaled to 0 nodes and when we try to scale up to bring new nodes, all the nodes are going to NotReady state. Even if a new node tries to come up due to autoscaling, that goes to NotReady as well. We could observe the below errors when we describe the node.

:~$ kubectl describe node gke-cluster-z8z9
Name:               gke-cluster-7d487589--z8z9
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n1-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-cluster-7d487589-z8z9
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-standard-4
                    topology.gke.io/zone=us-central1-c
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-c
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/xxxxx/zones/us-central1-c/instances/gke-cluster-8z9"}
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 23 Jun 2022 06:35:09 +0000
Taints:             node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-cluster-z8z9
  AcquireTime:     <unset>
  RenewTime:       Thu, 23 Jun 2022 09:43:52 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  CorruptDockerOverlay2         False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KernelDeadlock                False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentKubeletRestart        False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  NetworkUnavailable            True    Mon, 01 Jan 0001 00:00:00 +0000   Thu, 23 Jun 2022 06:35:09 +0000   NoRouteCreated                  Node created without a route
  MemoryPressure                False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletNotReady                 container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Also, few of the pods in kube-system namespace were also in Pending state and kube-proxy pod was crashing continuously on each node.

:~$ kubectl get pods -n kube-system
NAME                                                             READY   STATUS    RESTARTS   AGE
calico-node-vertical-autoscaler-7cf6d44df5-vswk7                 0/1     Pending   0          4h41m
calico-typha-7fdbb7d849-98592                                    0/1     Pending   0          4h41m
calico-typha-horizontal-autoscaler-hvd8g              0/1     Pending   0          4h41m
calico-typha-vertical-autoscaler-8gnxv                 0/1     Pending   0          4h41m
event-exporter-gke-5479fd58c8-kfz5g                              0/2     Pending   0          4h41m
fluentbit-gke-fpkn8                                              2/2     Running   0          3h29m
fluentbit-gke-wgg76                                              2/2     Running   0          3h29m
gke-metrics-agent-bngdb                                          1/1     Running   0          3h29m
gke-metrics-agent-qwh9d                                          1/1     Running   0          3h29m
kube-dns-7f4d6f474d-6cpnp                                        0/4     Pending   0          4h41m
kube-dns-7f4d6f474d-mwfms                                        0/4     Pending   0          4h41m
kube-dns-autoscaler-844c9d9448-txxbl                             0/1     Pending   0          4h41m
kube-proxy-gke-cluster-jvnx   1/1     Running   31         3h29m
kube-proxy-gke-cluster-z8z9   1/1     Running   31         3h29m
l7-default-backend-69fb9fd9f9-pj544                              0/1     Pending   0          4h41m
metrics-server-v0.4.5-bbb794dcc-s7t6z                            0/2     Pending   0          4h41m
pdcsi-node-nk7nj                                                 2/2     Running   0          3h29m
pdcsi-node-xt2mm                                                 2/2     Running   0          3h29m

Calico pods were having this error on describing:

:~$ kubectl describe pod calico-node-fgcsr  -n kube-system
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  37s               default-scheduler  Successfully assigned kube-system/calico-node-fgcsr to gke-cluster-jvnx
  Normal   Pulling    37s               kubelet            Pulling image "gke.gcr.io/calico/cni:v3.18.6-gke.0"
  Normal   Pulled     35s               kubelet            Successfully pulled image "gke.gcr.io/calico/cni:v3.18.6-gke.0" in 1.715111913s
  Normal   Created    34s               kubelet            Created container install-cni
  Normal   Started    34s               kubelet            Started container install-cni
  Normal   Pulling    32s               kubelet            Pulling image "gke.gcr.io/calico/node:v3.18.6-gke.1"
  Normal   Pulled     28s               kubelet            Successfully pulled image "gke.gcr.io/calico/node:v3.18.6-gke.1" in 3.765461978s
  Normal   Created    27s               kubelet            Created container calico-node
  Normal   Started    27s               kubelet            Started container calico-node
  Warning  Unhealthy  26s               kubelet            Readiness probe failed: Get "http://127.0.0.1:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  8s (x2 over 18s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

We could observe that, on autoscaling, the new VM instances had indeed come up in GKE console, but they were unable to join the cluster and exisitng nodes were going to NotReady on cluster restart.

If we delete the validatingwebhookconfiguration- capsule-validating-webhook-configuration, all these errors are resolved and cluster also works perfectly fine. We are even able to create tenants and the restrictions are also working fine.

Please let us know why is this webhook causing this problem and if this can be fixed.

The text was updated successfully, but these errors were encountered:

prometherion · 2022-06-23T13:23:53Z

Hi @MeghanaSrinath, thanks for reporting this!

This is expected behavior and the reason is for the following webhook:
https://github.com/clastix/capsule/blob/741db523e5edd8aa0fb5dece5005542217317d9c/charts/capsule/templates/validatingwebhookconfiguration.yaml#L259-L286

This webhook controls specific actions that a tenant owner could issue against nodes of their tenant. It's a requirement for the BYOD scenario, along with Capsule Proxy, section Nodes.

For your use case, the /nodes handler policy is too strict and you can change the value of failurePolicy from Fail to Ignore using the Helm --set CLI flag (--set webhooks.nodes.failurePolicy=Ignore).

Let me know if I can help you further with any change you'd like to propose, and feel free to close the issue.

MeghanaSrinath · 2022-07-04T04:29:35Z

This fix worked perfectly. Thank you!

MeghanaSrinath added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Jun 23, 2022

prometherion self-assigned this Jun 23, 2022

prometherion added question Further information is requested and removed bug Something isn't working blocked-needs-validation Issue need triage and validation labels Jun 23, 2022

MeghanaSrinath closed this as completed Jul 4, 2022

prometherion mentioned this issue Sep 19, 2022

node webhook combined with karpenter causes new nodes to fail #642

Closed

prometherion mentioned this issue Mar 1, 2023

Azure Kubernetes Service with Capsule chart installed is unable to start after shutdown #719

Closed

prometherion mentioned this issue May 9, 2023

Unable to start aks cluster after capsule installation #758

Closed

prometherion mentioned this issue Jul 13, 2023

ValidationWebhookConfiguration blocked kube-ovn-cni pods #787

Closed

pratik705 mentioned this issue Jul 18, 2024

Kuberntes cluster goes down if capsule is down #1135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE nodes going to NotReady state after installing capsule #597

GKE nodes going to NotReady state after installing capsule #597

MeghanaSrinath commented Jun 23, 2022

prometherion commented Jun 23, 2022

MeghanaSrinath commented Jul 4, 2022

GKE nodes going to NotReady state after installing capsule #597

GKE nodes going to NotReady state after installing capsule #597

Comments

MeghanaSrinath commented Jun 23, 2022

prometherion commented Jun 23, 2022

MeghanaSrinath commented Jul 4, 2022