- Helm Error: no available release name found
- NFS volume mount failure: wrong fs type
- Recover Statefulset when node fails
- Recover Operator when node fails
- External-IP details truncated in older Kubectl Client Versions
- Logs missing when Pravega upgrades
When installing a cluster for the first time using kubeadm
, the initialization defaults to setting up RBAC controlled access, which messes with permissions needed by Tiller to do installations, scan for installed components, and so on. helm init
works without issue, but helm list
, helm install
and other commands do not work.
$ helm install stable/nfs-server-provisioner
Error: no available release name found
The following workaround can be applied to resolve the issue:
- Create a service account for the Tiller.
kubectl create serviceaccount --namespace kube-system tiller
- Bind that service account to the
cluster-admin
ClusterRole.
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
- Add the service account to the Tiller deployment.
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
The above commands should resolve the errors and helm install
should work correctly.
If you experience wrong fs type
issues when pods are trying to mount NFS volumes like in the kubectl describe po/pravega-segmentstore-0
snippet below, make sure that all Kubernetes node have the nfs-common
system package installed. You can just try to run the mount.nfs
command to make sure NFS support is installed in your system.
In PKS, make sure to use v1.2.3
or newer. Older versions of PKS won't have NFS support installed in Kubernetes nodes.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 10m (x222 over 10h) kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468 Unable to mount volumes for pod "nautilus-segmentstore-0_nautilus-pravega(79167f33-f73b-11e8-936a-005056aeca39)": timeout expired waiting for volumes to attach or mount for pod "nautilus-pravega"/"nautilus-segmentstore-0". list of unmounted volumes=[tier2]. list of unattached volumes=[cache tier2 pravega-segment-store-token-fvxql]
Warning FailedMount <invalid> (x343 over 10h) kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468 (combined from similar events): MountVolume.SetUp failed for volume "pvc-6fa77d63-f73b-11e8-936a-005056aeca39" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 --scope -- mount -t nfs -o vers=4.1 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39
Output: Running scope as unit run-rc77b988cdec041f6aa91c8ddd8455587.scope.
mount: wrong fs type, bad option, bad superblock on 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
When a node failure happens, unlike Deployment Pod, the Statefulset Pod on that failed node will not be rescheduled to other available nodes automatically. This is because Kubernetes guarantees at most once execution of a Statefulset. See the design.
If the failed node is not coming back, the cluster admin can manually recover the lost pod of Statefulset. To do that, the cluster admin can delete the failed node object in the apiserver by running
kubectl delete node <node name>
After the failed node is deleted from Kubernetes, the Statefulset pods on that node will be rescheduled to other available nodes.
If the Operator pod is deployed on the node that fails, the pod will be rescheduled to a healthy node. However, the Operator will not function properly because it has a leader election locking mechanism. See here.
To make it work, the cluster admin will need to delete the lock by running
kubectl delete configmap pravega-operator-lock
After that, the new Operator pod will become the leader. If the node comes up later, the extra Operator pod will be deleted by Deployment controller.
When Pravega is deployed with external-access enabled
, an External-IP is assigned to its controller and segment store services, which is used by clients to access it. The External-IP details can be viewed in the output of the kubectl get svc
.
However, when using kubectl client version v1.10.x
or lower, the External-IP for the controller and segment store services appears truncated in the output.
# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.200.1 <none> 443/TCP 6d
pravega-bookie-headless ClusterIP None <none> 3181/TCP 6d
pravega-pravega-controller LoadBalancer 10.100.200.11 10.240.124.15... 10080:31391/TCP,9090:30301/TCP 6d
pravega-pravega-segmentstore-0 LoadBalancer 10.100.200.59 10.240.124.15... 12345:30597/TCP 6d
pravega-pravega-segmentstore-1 LoadBalancer 10.100.200.42 10.240.124.15... 12345:30840/TCP 6d
pravega-pravega-segmentstore-2 LoadBalancer 10.100.200.83 10.240.124.15... 12345:31170/TCP 6d
pravega-pravega-segmentstore-headless ClusterIP None <none> 12345/TCP 6d
pravega-zk-client ClusterIP 10.100.200.120 <none> 2181/TCP 6d
pravega-zk-headless ClusterIP None <none> 2888/TCP,3888/TCP 6d
This problem has however been solved in kubectl client version v1.11.0
onwards.
# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.200.1 <none> 443/TCP 6d20h
pravega-bookie-headless ClusterIP None <none> 3181/TCP 6d3h
pravega-pravega-controller LoadBalancer 10.100.200.11 10.240.124.155,100.64.112.185 10080:31391/TCP,9090:30301/TCP 6d3h
pravega-pravega-segmentstore-0 LoadBalancer 10.100.200.59 10.240.124.156,100.64.112.185 12345:30597/TCP 6d3h
pravega-pravega-segmentstore-1 LoadBalancer 10.100.200.42 10.240.124.157,100.64.112.185 12345:30840/TCP 6d3h
pravega-pravega-segmentstore-2 LoadBalancer 10.100.200.83 10.240.124.158,100.64.112.185 12345:31170/TCP 6d3h
pravega-pravega-segmentstore-headless ClusterIP None <none> 12345/TCP 6d3h
pravega-zk-client ClusterIP 10.100.200.120 <none> 2181/TCP 6d3h
pravega-zk-headless ClusterIP None <none> 2888/TCP,3888/TCP 6d3h
Also, while using kubectl client version v1.10.x
or lower, the complete External-IP can still be viewed by doing a kubectl describe svc
for the concerned service.
# kubectl describe svc pravega-pravega-controller
Name: pravega-pravega-controller
Namespace: default
Labels: app=pravega-cluster
component=pravega-controller
pravega_cluster=pravega
Annotations: ncp/internal_ip_for_policy=100.64.161.119
Selector: app=pravega-cluster,component=pravega-controller,pravega_cluster=pravega
Type: LoadBalancer
IP: 10.100.200.34
LoadBalancer Ingress: 10.247.114.149, 100.64.161.119
Port: rest 10080/TCP
TargetPort: 10080/TCPc
NodePort: rest 32097/TCP
Endpoints:
Port: grpc 9090/TCP
TargetPort: 9090/TCP
NodePort: grpc 32705/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Users may find Pravega logs for old pods to be missing post upgrade. This is because the operator uses the Kubernetes
rolling update
strategy to upgrade pod one at a time. This strategy will use a new replicaset for the update, it will kill one pod in the
old replicaset and start a pod in the new replicaset in the meantime. So after upgrading, users are actually using a new
replicaset, thus the logs for the old pod cannot be obtained using kubectl logs
.