Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 81: Support rolling upgrades #124

Merged
merged 32 commits into from
Apr 5, 2019
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
38b8ab5
Mark checkbox in README file
adrianmo Jan 30, 2019
920a271
Update API spec to unify version to use
adrianmo Jan 31, 2019
6cb1be6
WiP
adrianmo Jan 31, 2019
6711ced
Merge branch 'master' into issue-81-upgrade-feature
adrianmo Feb 6, 2019
64f76fc
Create upgrade framework
adrianmo Feb 7, 2019
9c7cf6b
upgrade framework updates
adrianmo Feb 13, 2019
ec02d90
Implement upgrade process for Controller and SegmentStore
adrianmo Feb 27, 2019
dddda57
BK upgrade logic
adrianmo Mar 4, 2019
ea006bd
Merge branch 'master' into issue-81-upgrade-feature
adrianmo Mar 7, 2019
bd8a2a7
BK upgrade
adrianmo Mar 12, 2019
f375f29
Merge branch 'master' into issue-81-upgrade-feature
adrianmo Mar 26, 2019
4870b6b
Fix merge issues
adrianmo Mar 26, 2019
c358063
Wip
adrianmo Mar 28, 2019
9160c46
Merge branch 'master' into issue-81-upgrade-feature
adrianmo Mar 28, 2019
31df3ae
WiP
adrianmo Mar 29, 2019
ded3d9a
Normlize cluster version
adrianmo Apr 1, 2019
6e3b930
Merge branch 'master' into issue-81-upgrade-feature
adrianmo Apr 1, 2019
b4772cd
Re-enable other e2e tests
adrianmo Apr 1, 2019
9648da5
Set default version if not set
adrianmo Apr 1, 2019
0cba0cb
dont upgrade if pods are not ready
adrianmo Apr 1, 2019
cadcf45
Add unit tests
adrianmo Apr 1, 2019
3febbf6
Update README with upgrade and scale information
adrianmo Apr 1, 2019
567dd40
Add labels to SS and CC
adrianmo Apr 1, 2019
66c596f
Remove commented lines
adrianmo Apr 2, 2019
991188b
Draft of upgrade guide
adrianmo Apr 2, 2019
e153f2b
Update documentation
adrianmo Apr 3, 2019
9770f91
Add upgrade scope section
adrianmo Apr 3, 2019
0d875c3
Make API change backwards compatible
adrianmo Apr 4, 2019
c098ad8
Update API clients
adrianmo Apr 4, 2019
f98a80a
Increase upgrade timeout
adrianmo Apr 4, 2019
67bc4ec
Merge create and recreate e2e test cases
adrianmo Apr 4, 2019
1593fa1
Changes on image tag
adrianmo Apr 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions Gopkg.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions Gopkg.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,7 @@ required = [
[[prune.project]]
name = "k8s.io/code-generator"
non-go = false

[[constraint]]
name = "github.com/hashicorp/go-version"
version = "1.1.0"
212 changes: 84 additions & 128 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,23 @@ The project is currently alpha. While no breaking API changes are currently plan
* [Usage](#usage)
* [Installation of the Operator](#install-the-operator)
* [Deploy a sample Pravega Cluster](#deploy-a-sample-pravega-cluster)
* [Scale a Pravega Cluster](#scale-a-pravega-cluster)
* [Upgrade a Pravega Cluster](#upgrade-a-pravega-cluster)
* [Uninstall the Pravega Cluster](#uninstall-the-pravega-cluster)
* [Uninstall the Operator](#uninstall-the-operator)
* [Configuration](#configuration)
* [Use non-default service accounts](#use-non-default-service-accounts)
* [Installing on a Custom Namespace with RBAC enabled](#installing-on-a-custom-namespace-with-rbac-enabled)
* [Tier 2: Google Filestore Storage](#use-google-filestore-storage-as-tier-2)
* [Tune Pravega Configurations](#tune-pravega-configuration)
* [Tune Pravega Configuration](#tune-pravega-configuration)
* [Enable external access](#enable-external-access)
* [Development](#development)
* [Build the Operator Image](#build-the-operator-image)
* [Installation on GKE](#installation-on-google-kubernetes-engine)
* [Direct Access to Cluster](#direct-access-to-the-cluster)
* [Run the Operator Locally](#run-the-operator-locally)
* [Run the Operator locally](#run-the-operator-locally)
* [Releases](#releases)
* [Troubleshooting](#troubleshooting)
* [Helm Error: no available release name found](#helm-error-no-available-release-name-found)
* [NFS volume mount failure: wrong fs type](#nfs-volume-mount-failure-wrong-fs-type)

## Overview

[Pravega](http://pravega.io) is an open source distributed storage service implementing Streams. It offers Stream as the main primitive for the foundation of reliable storage systems: *a high-performance, durable, elastic, and unlimited append-only byte stream with strict ordering and consistency*.
Expand All @@ -37,7 +38,7 @@ The Pravega Operator manages Pravega clusters deployed to Kubernetes and automat

- [x] Create and destroy a Pravega cluster
- [x] Resize cluster
- [ ] Rolling upgrades
- [x] Rolling upgrades (experimental)

> Note that unchecked features are in the roadmap but not available yet.

Expand Down Expand Up @@ -117,51 +118,20 @@ Use the following YAML template to install a small development Pravega Cluster (
apiVersion: "pravega.pravega.io/v1alpha1"
kind: "PravegaCluster"
metadata:
name: "pravega"
name: "example"
spec:
version: 0.4.0
zookeeperUri: [ZOOKEEPER_HOST]:2181

bookkeeper:
image:
repository: pravega/bookkeeper
tag: 0.4.0
pullPolicy: IfNotPresent

replicas: 3

storage:
ledgerVolumeClaimTemplate:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard"
resources:
requests:
storage: 10Gi

journalVolumeClaimTemplate:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard"
resources:
requests:
storage: 10Gi

imageRepository: pravega/bookkeeper
autoRecovery: true

pravega:
controllerReplicas: 1
segmentStoreReplicas: 3

cacheVolumeClaimTemplate:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard"
resources:
requests:
storage: 20Gi

image:
repository: pravega/pravega
tag: 0.4.0
pullPolicy: IfNotPresent

imageRepository: pravega/pravega
tier2:
filesystem:
persistentVolumeClaim:
Expand All @@ -178,60 +148,73 @@ Deploy the Pravega cluster.
$ kubectl create -f pravega.yaml
```

Verify that the cluster instances and its components are running.
Verify that the cluster instances and its components are being created.

```
$ kubectl get PravegaCluster
NAME AGE
pravega 27s
NAME VERSION DESIRED MEMBERS READY MEMBERS AGE
example 0.4.0 7 0 25s
```

After a couple of minutes, all cluster members should become ready.

```
$ kubectl get PravegaCluster
NAME VERSION DESIRED MEMBERS READY MEMBERS AGE
example 0.4.0 7 7 2m
```
$ kubectl get all -l pravega_cluster=pravega
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/pravega-pravega-controller 1 1 1 1 1m

NAME DESIRED CURRENT READY AGE
rs/pravega-pravega-controller-7489c9776d 1 1 1 1m
```
$ kubectl get all -l pravega_cluster=example
NAME READY STATUS RESTARTS AGE
pod/example-bookie-0 1/1 Running 0 2m
pod/example-bookie-1 1/1 Running 0 2m
pod/example-bookie-2 1/1 Running 0 2m
pod/example-pravega-controller-64ff87fc49-kqp9k 1/1 Running 0 2m
pod/example-pravega-segmentstore-0 1/1 Running 0 2m
pod/example-pravega-segmentstore-1 1/1 Running 0 1m
pod/example-pravega-segmentstore-2 1/1 Running 0 30s

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/pravega-pravega-controller 1 1 1 1 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/example-bookie-headless ClusterIP None <none> 3181/TCP 2m
service/example-pravega-controller ClusterIP 10.23.244.3 <none> 10080/TCP,9090/TCP 2m
service/example-pravega-segmentstore-headless ClusterIP None <none> 12345/TCP 2m

NAME DESIRED CURRENT READY AGE
rs/pravega-pravega-controller-7489c9776d 1 1 1 1m
NAME DESIRED CURRENT READY AGE
replicaset.apps/example-pravega-controller-64ff87fc49 1 1 1 2m

NAME DESIRED CURRENT AGE
statefulsets/pravega-bookie 3 3 1m
statefulsets/pravega-segmentstore 3 3 1m
NAME DESIRED CURRENT AGE
statefulset.apps/example-bookie 3 3 2m
statefulset.apps/example-pravega-segmentstore 3 3 2m
```

NAME READY STATUS RESTARTS AGE
po/pravega-bookie-0 1/1 Running 0 1m
po/pravega-bookie-1 1/1 Running 0 1m
po/pravega-bookie-2 1/1 Running 0 1m
po/pravega-pravega-controller-7489c9776d-lcw9x 1/1 Running 0 1m
po/pravega-segmentstore-0 1/1 Running 0 1m
po/pravega-segmentstore-1 1/1 Running 0 1m
po/pravega-segmentstore-2 1/1 Running 0 1m
By default, a `PravegaCluster` instance is only accessible within the cluster through the Controller `ClusterIP` service. From within the Kubernetes cluster, a client can connect to Pravega at:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/pravega-bookie-headless ClusterIP None <none> 3181/TCP 1m
svc/pravega-pravega-controller ClusterIP 10.3.255.239 <none> 10080/TCP,9090/TCP 1m
```
tcp://<pravega-name>-pravega-controller.<namespace>:9090
```

A `PravegaCluster` instance is only accessible WITHIN the cluster (i.e. no outside access is allowed) using the following endpoint in
the PravegaClient.
And the `REST` management interface is available at:

```
tcp://<cluster-name>-pravega-controller.<namespace>:9090
http://<pravega-name>-pravega-controller.<namespace>:10080/
```

The `REST` management interface is available at:
[Check this](#enable-external-access) to enable external access to a Pravega cluster.

### Scale a Pravega cluster

You can scale Pravega components independently by modifying their corresponding field in the Pravega resource spec. You can either `kubectl edit` the cluster or `kubectl patch` it. If you edit it, update the number of replicas for BookKeeper, Controller, and/or Segment Store and save the updated spec.

Example of patching the Pravega resource to scale the Segment Store instances to 4.

```
http://<cluster-name>-pravega-controller.<namespace>:10080/
kubectl patch PravegaCluster example --type='json' -p='[{"op": "replace", "path": "/spec/pravega/segmentStoreReplicas", "value": 4}]'
```

[Check this](#direct-access-to-the-cluster) to enable direct access to the cluster for development purposes.
### Upgrade a Pravega cluster

Check out the [upgrade guide](doc/pravega-upgrade-guide.md).

### Uninstall the Pravega cluster

Expand Down Expand Up @@ -425,6 +408,32 @@ spec:
...
```

### Enable external access

By default, a Pravega cluster uses `ClusterIP` services which are only accessible from within Kubernetes. However, when creating the Pravega cluster resource, you can opt to enable external access.

In Pravega, clients initiate the communication with the Pravega Controller, which is a stateless component frontended by a Kubernetes service that load-balances the requests to the backend pods. Then, clients discover the individual Segment Store instances to which they directly read and write data to. Clients need to be able to reach each and every Segment Store pod in the Pravega cluster.

If your Pravega cluster needs to be consumed by clients from outside Kubernetes (or from another Kubernetes deployment), you can enable external access in two ways, depending on your environment constraints and requirements. Both ways will create one service for all Controllers, and one service for each Segment Store pod.

1. Via [`LoadBalancer`](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) service type.
2. Via [`NodePort`](https://kubernetes.io/docs/concepts/services-networking/service/#nodeport) service type.

You can read more about them in the [Kubernetes documentation](https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types) to understand which one fits your use case.

Example of configuration for using `LoadBalancer` service types:

```yaml
...
spec:
externalAccess:
enabled: true
type: LoadBalancer
...
```

Clients will need to connect to the external Controller address and will automatically discover the external address of all Segment Store pods.

## Development

### Build the operator image
Expand Down Expand Up @@ -485,13 +494,6 @@ On GKE, the following command must be run before installing the Operator, replac
$ kubectl create clusterrolebinding your-user-cluster-admin-binding --clusterrole=cluster-admin --user=your.google.cloud.email@example.org
```

### Direct access to the cluster

For debugging and development you might want to access the Pravega cluster directly. For example, if you created the cluster with name `pravega` in the `default` namespace you can forward ports of the Pravega controller pod with name `pravega-pravega-controller-68657d67cd-w5x8b` as follows:

```
$ kubectl port-forward -n default pravega-pravega-controller-68657d67cd-w5x8b 9090:9090 10080:10080
```
## Run the Operator locally

You can run the Operator locally to help with development, testing, and debugging tasks.
Expand All @@ -501,57 +503,11 @@ The following command will run the Operator locally with the default Kubernetes
```
$ operator-sdk up local
```

## Releases

The latest Pravega releases can be found on the [Github Release](https://github.com/pravega/pravega-operator/releases) project page.

## Troubleshooting

### Helm Error: no available release name found

When installing a cluster for the first time using `kubeadm`, the initialization defaults to setting up RBAC controlled access, which messes with permissions needed by Tiller to do installations, scan for installed components, and so on. `helm init` works without issue, but `helm list`, `helm install` and other commands do not work.

```
$ helm install stable/nfs-server-provisioner
Error: no available release name found
```
The following workaround can be applied to resolve the issue:

1. Create a service account for the Tiller.
```
kubectl create serviceaccount --namespace kube-system tiller
```
2. Bind that service account to the `cluster-admin` ClusterRole.
```
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
```
3. Add the service account to the Tiller deployment.

```
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
```
The above commands should resolve the errors and `helm install` should work correctly.

### NFS volume mount failure: wrong fs type

If you experience `wrong fs type` issues when pods are trying to mount NFS volumes like in the `kubectl describe po/pravega-segmentstore-0` snippet below, make sure that all Kubernetes node have the `nfs-common` system package installed. You can just try to run the `mount.nfs` command to make sure NFS support is installed in your system.

In PKS, make sure to use [`v1.2.3`](https://docs.pivotal.io/runtimes/pks/1-2/release-notes.html#v1.2.3) or newer. Older versions of PKS won't have NFS support installed in Kubernetes nodes.

```
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 10m (x222 over 10h) kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468 Unable to mount volumes for pod "nautilus-segmentstore-0_nautilus-pravega(79167f33-f73b-11e8-936a-005056aeca39)": timeout expired waiting for volumes to attach or mount for pod "nautilus-pravega"/"nautilus-segmentstore-0". list of unmounted volumes=[tier2]. list of unattached volumes=[cache tier2 pravega-segment-store-token-fvxql]
Warning FailedMount <invalid> (x343 over 10h) kubelet, 53931b0d-18f4-49fd-a105-49b1fea3f468 (combined from similar events): MountVolume.SetUp failed for volume "pvc-6fa77d63-f73b-11e8-936a-005056aeca39" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 --scope -- mount -t nfs -o vers=4.1 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39 /var/lib/kubelet/pods/79167f33-f73b-11e8-936a-005056aeca39/volumes/kubernetes.io~nfs/pvc-6fa77d63-f73b-11e8-936a-005056aeca39
Output: Running scope as unit run-rc77b988cdec041f6aa91c8ddd8455587.scope.
mount: wrong fs type, bad option, bad superblock on 10.100.200.247:/export/pvc-6fa77d63-f73b-11e8-936a-005056aeca39,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)

In some cases useful info is found in syslog - try
dmesg | tail or so.
```
Check out the [troubleshooting document](doc/troubleshooting.md).
8 changes: 8 additions & 0 deletions deploy/crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ spec:
plural: pravegaclusters
singular: pravegacluster
additionalPrinterColumns:
- name: Version
type: string
description: The current pravega version
JSONPath: .status.currentVersion
- name: Desired Version
type: string
description: The desired pravega version
JSONPath: .status.TargetVersion
- name: Desired Members
type: integer
description: The number of desired pravega members
Expand Down
Loading