Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE nodes going to NotReady state after installing capsule #597

Closed
MeghanaSrinath opened this issue Jun 23, 2022 · 2 comments
Closed

GKE nodes going to NotReady state after installing capsule #597

MeghanaSrinath opened this issue Jun 23, 2022 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@MeghanaSrinath
Copy link

We have a GKE cluster on which we have deployed capsule. We have used this values.yaml file during installation via helm. But once the cluster is scaled to 0 nodes and when we try to scale up to bring new nodes, all the nodes are going to NotReady state. Even if a new node tries to come up due to autoscaling, that goes to NotReady as well. We could observe the below errors when we describe the node.

:~$ kubectl describe node gke-cluster-z8z9
Name:               gke-cluster-7d487589--z8z9
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=n1-standard-4
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gke-cluster-7d487589-z8z9
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=n1-standard-4
                    topology.gke.io/zone=us-central1-c
                    topology.kubernetes.io/region=us-central1
                    topology.kubernetes.io/zone=us-central1-c
Annotations:        csi.volume.kubernetes.io/nodeid:
                      {"pd.csi.storage.gke.io":"projects/xxxxx/zones/us-central1-c/instances/gke-cluster-8z9"}
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 23 Jun 2022 06:35:09 +0000
Taints:             node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  gke-cluster-z8z9
  AcquireTime:     <unset>
  RenewTime:       Thu, 23 Jun 2022 09:43:52 +0000
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  CorruptDockerOverlay2         False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentUnregisterNetDevice   node is functioning properly
  KernelDeadlock                False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentKubeletRestart        False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Thu, 23 Jun 2022 09:40:36 +0000   Thu, 23 Jun 2022 06:35:14 +0000   NoFrequentContainerdRestart     containerd is functioning properly
  NetworkUnavailable            True    Mon, 01 Jan 0001 00:00:00 +0000   Thu, 23 Jun 2022 06:35:09 +0000   NoRouteCreated                  Node created without a route
  MemoryPressure                False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         False   Thu, 23 Jun 2022 09:41:38 +0000   Thu, 23 Jun 2022 06:35:09 +0000   KubeletNotReady                 container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Also, few of the pods in kube-system namespace were also in Pending state and kube-proxy pod was crashing continuously on each node.

:~$ kubectl get pods -n kube-system
NAME                                                             READY   STATUS    RESTARTS   AGE
calico-node-vertical-autoscaler-7cf6d44df5-vswk7                 0/1     Pending   0          4h41m
calico-typha-7fdbb7d849-98592                                    0/1     Pending   0          4h41m
calico-typha-horizontal-autoscaler-hvd8g              0/1     Pending   0          4h41m
calico-typha-vertical-autoscaler-8gnxv                 0/1     Pending   0          4h41m
event-exporter-gke-5479fd58c8-kfz5g                              0/2     Pending   0          4h41m
fluentbit-gke-fpkn8                                              2/2     Running   0          3h29m
fluentbit-gke-wgg76                                              2/2     Running   0          3h29m
gke-metrics-agent-bngdb                                          1/1     Running   0          3h29m
gke-metrics-agent-qwh9d                                          1/1     Running   0          3h29m
kube-dns-7f4d6f474d-6cpnp                                        0/4     Pending   0          4h41m
kube-dns-7f4d6f474d-mwfms                                        0/4     Pending   0          4h41m
kube-dns-autoscaler-844c9d9448-txxbl                             0/1     Pending   0          4h41m
kube-proxy-gke-cluster-jvnx   1/1     Running   31         3h29m
kube-proxy-gke-cluster-z8z9   1/1     Running   31         3h29m
l7-default-backend-69fb9fd9f9-pj544                              0/1     Pending   0          4h41m
metrics-server-v0.4.5-bbb794dcc-s7t6z                            0/2     Pending   0          4h41m
pdcsi-node-nk7nj                                                 2/2     Running   0          3h29m
pdcsi-node-xt2mm                                                 2/2     Running   0          3h29m

Calico pods were having this error on describing:

:~$ kubectl describe pod calico-node-fgcsr  -n kube-system
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  37s               default-scheduler  Successfully assigned kube-system/calico-node-fgcsr to gke-cluster-jvnx
  Normal   Pulling    37s               kubelet            Pulling image "gke.gcr.io/calico/cni:v3.18.6-gke.0"
  Normal   Pulled     35s               kubelet            Successfully pulled image "gke.gcr.io/calico/cni:v3.18.6-gke.0" in 1.715111913s
  Normal   Created    34s               kubelet            Created container install-cni
  Normal   Started    34s               kubelet            Started container install-cni
  Normal   Pulling    32s               kubelet            Pulling image "gke.gcr.io/calico/node:v3.18.6-gke.1"
  Normal   Pulled     28s               kubelet            Successfully pulled image "gke.gcr.io/calico/node:v3.18.6-gke.1" in 3.765461978s
  Normal   Created    27s               kubelet            Created container calico-node
  Normal   Started    27s               kubelet            Started container calico-node
  Warning  Unhealthy  26s               kubelet            Readiness probe failed: Get "http://127.0.0.1:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
  Warning  Unhealthy  8s (x2 over 18s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

We could observe that, on autoscaling, the new VM instances had indeed come up in GKE console, but they were unable to join the cluster and exisitng nodes were going to NotReady on cluster restart.

If we delete the validatingwebhookconfiguration- capsule-validating-webhook-configuration, all these errors are resolved and cluster also works perfectly fine. We are even able to create tenants and the restrictions are also working fine.

Please let us know why is this webhook causing this problem and if this can be fixed.

@MeghanaSrinath MeghanaSrinath added blocked-needs-validation Issue need triage and validation bug Something isn't working labels Jun 23, 2022
@prometherion prometherion self-assigned this Jun 23, 2022
@prometherion
Copy link
Member

Hi @MeghanaSrinath, thanks for reporting this!

This is expected behavior and the reason is for the following webhook:
https://github.com/clastix/capsule/blob/741db523e5edd8aa0fb5dece5005542217317d9c/charts/capsule/templates/validatingwebhookconfiguration.yaml#L259-L286

This webhook controls specific actions that a tenant owner could issue against nodes of their tenant. It's a requirement for the BYOD scenario, along with Capsule Proxy, section Nodes.

For your use case, the /nodes handler policy is too strict and you can change the value of failurePolicy from Fail to Ignore using the Helm --set CLI flag (--set webhooks.nodes.failurePolicy=Ignore).

Let me know if I can help you further with any change you'd like to propose, and feel free to close the issue.

@prometherion prometherion added question Further information is requested and removed bug Something isn't working blocked-needs-validation Issue need triage and validation labels Jun 23, 2022
@MeghanaSrinath
Copy link
Author

This fix worked perfectly. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants