Skip to content

Latest commit

 

History

History
160 lines (126 loc) · 10.3 KB

troubleshooting.md

File metadata and controls

160 lines (126 loc) · 10.3 KB

Troubleshooting

Table of contents:

Helm Error: no available release name found

When installing a cluster for the first time using kubeadm, the initialization defaults to setting up RBAC controlled access, which messes with permissions needed by Tiller to do installations, scan for installed components, and so on. helm init works without issue, but helm list, helm install and other commands do not work.

$ helm install stable/nfs-server-provisioner
Error: no available release name found

The following workaround can be applied to resolve the issue:

  1. Create a service account for the Tiller.
kubectl create serviceaccount --namespace kube-system tiller
  1. Bind that service account to the cluster-admin ClusterRole.
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
  1. Add the service account to the Tiller deployment.
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'

The above commands should resolve the errors and helm install should work correctly.

NFS volume mount failure: wrong fs type

If you experience wrong fs type issues when pods are trying to mount NFS volumes like in the kubectl describe po/pravega-segmentstore-0 snippet below, make sure that all Kubernetes node have the nfs-common system package installed. You can just try to run the mount.nfs command to make sure NFS support is installed in your system.

In PKS, make sure to use v1.2.3 or newer. Older versions of PKS won't have NFS support installed in Kubernetes nodes.

Events:
  Type     Reason       Age                        From                                           Message
  ----     ------       ----                       ----                                           -------
  Warning  FailedMount  10m (x222 over 10h)        kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468  Unable to mount volumes for pod "nautilus-segmentstore-0_nautilus-pravega(79167f33-f73b-11e8-936a-005056aeca39)": timeout expired waiting for volumes to attach or mount for pod "nautilus-pravega"/"nautilus-segmentstore-0". list of unmounted volumes=[tier2]. list of unattached volumes=[cache tier2 pravega-segment-store-token-fvxql]
  Warning  FailedMount  <invalid> (x343 over 10h)  kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468  (combined from similar events): MountVolume.SetUp failed for volume "pvc-6fa77d63-f73b-11e8-936a-005056aeca39" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for   /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 --scope -- mount -t nfs -o vers=4.1 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39
Output: Running scope as unit run-rc77b988cdec041f6aa91c8ddd8455587.scope.
mount: wrong fs type, bad option, bad superblock on 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Recover Statefulset when node fails

When a node failure happens, unlike Deployment Pod, the Statefulset Pod on that failed node will not be rescheduled to other available nodes automatically. This is because Kubernetes guarantees at most once execution of a Statefulset. See the design.

If the failed node is not coming back, the cluster admin can manually recover the lost pod of Statefulset. To do that, the cluster admin can delete the failed node object in the apiserver by running

kubectl delete node <node name>

After the failed node is deleted from Kubernetes, the Statefulset pods on that node will be rescheduled to other available nodes.

Recover Operator when node fails

If the Operator pod is deployed on the node that fails, the pod will be rescheduled to a healthy node. However, the Operator will not function properly because it has a leader election locking mechanism. See here.

To make it work, the cluster admin will need to delete the lock by running

kubectl delete configmap pravega-operator-lock

After that, the new Operator pod will become the leader. If the node comes up later, the extra Operator pod will be deleted by Deployment controller.

External-IP details truncated in older Kubectl Client Versions

When Pravega is deployed with external-access enabled, an External-IP is assigned to its controller and segment store services, which is used by clients to access it. The External-IP details can be viewed in the output of the kubectl get svc. However, when using kubectl client version v1.10.x or lower, the External-IP for the controller and segment store services appears truncated in the output.

# kubectl get svc
NAME                                    TYPE           CLUSTER-IP       EXTERNAL-IP        PORT(S)                          AGE
kubernetes                              ClusterIP      10.100.200.1     <none>             443/TCP                          6d
pravega-bookie-headless                 ClusterIP      None             <none>             3181/TCP                         6d
pravega-pravega-controller              LoadBalancer   10.100.200.11    10.240.124.15...   10080:31391/TCP,9090:30301/TCP   6d
pravega-pravega-segmentstore-0          LoadBalancer   10.100.200.59    10.240.124.15...   12345:30597/TCP                  6d
pravega-pravega-segmentstore-1          LoadBalancer   10.100.200.42    10.240.124.15...   12345:30840/TCP                  6d
pravega-pravega-segmentstore-2          LoadBalancer   10.100.200.83    10.240.124.15...   12345:31170/TCP                  6d
pravega-pravega-segmentstore-headless   ClusterIP      None             <none>             12345/TCP                        6d
pravega-zk-client                       ClusterIP      10.100.200.120   <none>             2181/TCP                         6d
pravega-zk-headless                     ClusterIP      None             <none>             2888/TCP,3888/TCP                6d

This problem has however been solved in kubectl client version v1.11.0 onwards.

# kubectl get svc
NAME                                    TYPE           CLUSTER-IP       EXTERNAL-IP                     PORT(S)                          AGE
kubernetes                              ClusterIP      10.100.200.1     <none>                          443/TCP                          6d20h
pravega-bookie-headless                 ClusterIP      None             <none>                          3181/TCP                         6d3h
pravega-pravega-controller              LoadBalancer   10.100.200.11    10.240.124.155,100.64.112.185   10080:31391/TCP,9090:30301/TCP   6d3h
pravega-pravega-segmentstore-0          LoadBalancer   10.100.200.59    10.240.124.156,100.64.112.185   12345:30597/TCP                  6d3h
pravega-pravega-segmentstore-1          LoadBalancer   10.100.200.42    10.240.124.157,100.64.112.185   12345:30840/TCP                  6d3h
pravega-pravega-segmentstore-2          LoadBalancer   10.100.200.83    10.240.124.158,100.64.112.185   12345:31170/TCP                  6d3h
pravega-pravega-segmentstore-headless   ClusterIP      None             <none>                          12345/TCP                        6d3h
pravega-zk-client                       ClusterIP      10.100.200.120   <none>                          2181/TCP                         6d3h
pravega-zk-headless                     ClusterIP      None             <none>                          2888/TCP,3888/TCP                6d3h

Also, while using kubectl client version v1.10.x or lower, the complete External-IP can still be viewed by doing a kubectl describe svc for the concerned service.

# kubectl describe svc pravega-pravega-controller
Name:                     pravega-pravega-controller
Namespace:                default
Labels:                   app=pravega-cluster
                          component=pravega-controller
                          pravega_cluster=pravega
Annotations:              ncp/internal_ip_for_policy=100.64.161.119
Selector:                 app=pravega-cluster,component=pravega-controller,pravega_cluster=pravega
Type:                     LoadBalancer
IP:                       10.100.200.34
LoadBalancer Ingress:     10.247.114.149, 100.64.161.119
Port:                     rest  10080/TCP
TargetPort:               10080/TCPc
NodePort:                 rest  32097/TCP
Endpoints:                
Port:                     grpc  9090/TCP
TargetPort:               9090/TCP
NodePort:                 grpc  32705/TCP
Endpoints:                
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Logs missing when Pravega upgrades

Users may find Pravega logs for old pods to be missing post upgrade. This is because the operator uses the Kubernetes rolling update strategy to upgrade pod one at a time. This strategy will use a new replicaset for the update, it will kill one pod in the old replicaset and start a pod in the new replicaset in the meantime. So after upgrading, users are actually using a new replicaset, thus the logs for the old pod cannot be obtained using kubectl logs.