Skip to content

Commit

Permalink
Merge pull request #961 from gnufied/update-vsphere-driver-enhancement
Browse files Browse the repository at this point in the history
Update KEP for vsphere CSI driver operator
  • Loading branch information
openshift-merge-robot authored Dec 15, 2021
2 parents 90dafe7 + abb2ecd commit 2913529
Showing 1 changed file with 57 additions and 18 deletions.
75 changes: 57 additions & 18 deletions enhancements/storage/csi-vsphere-operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,10 @@ This enhancement proposes deployment of vSphere CSI driver on Openshift as a def

## Proposal

OCP ships with a vmware-vsphere-csi-driver-operator by default which is managed by [cluster-storage-operator](https://github.com/openshift/cluster-storage-operator/).
We propose that Openshift will ship with vmware-vsphere-csi-driver-operator by default which is managed by [cluster-storage-operator](https://github.com/openshift/cluster-storage-operator/).

## Design Details

vSphere CSI driver has few dependencies on installer though and they are:

### Installer dependency
Expand Down Expand Up @@ -105,38 +108,68 @@ Currently while deploying Openshift a user can configure datastore used by OCP v

To solve this problem vsphere CSI operator is going to create a storagePolicy by tagging selected datastore in the installer. This will require OCP to have expanded permissions of creating storagePolicies. After creating the storagePolicy, the vSphere CSI operator will also create corresponding storageclass.

If CSI operator can not create storage Policy for some reason:

- The operator will mark itself as disabled and stop further driver installation.
- It will periodically retry installation and wait for admin to grant permissions to create StoragePolicy.
- Create a metric if storagePolicy and storageClass creation fails in 4.7.

##### Hardware and vCenter version handling

When vSphere CSI operator starts, using credentials provided by cluster-storage-operator, it will first verifiy vCenter version and HW version of VMs. If vCenter version is not 6.7u3 or greater and HW version is not 15 or greater on all VMs and this is a fresh install(i.e there is no `CSIDriver` object already installed by this operator) - it will:
When vSphere CSI operator starts, using credentials provided by cluster-storage-operator, it will first verifiy vCenter version and HW version of VMs.
If vCenter version is not >= 6.7u3 or HW version is not 15 or greater on all VMs and this is a fresh install(i.e there is no `CSIDriver` object
already installed by this operator) - it will stop installation of vSphere CSI driver operator and periodically retry with exponential backoff.

Since vsphere-problem-detector also runs similar checks in 4.10 - our plan is to move those checks to CSI driver operator and if those checks fail
mark cluster as un-upgradeable. In 4.11 we are considering removal of vsphere-problem-detector altogether and brining all checks in CSI driver operator.

However, if additional VMs are added later into the cluster and they do not have HW version 15 or greater, Operator will mark itself as `degraded`. The vSphere CSI
driver operator will use presence of `CSIDriver` object with openshift annotation as indication that driver was installed previously.

1. Mark itself as disabled, stop CSI driver installation.
2. Add clear message to indicate the error.
3. It will keep retrying installation with exponential backoff assuming admin manually fixes the HW version of VMs.
###### Error handling during creation of storage policy and storage class

If CSI operator can not create storage Policy or storageClass for some reason:

If additional VMs are added later into the cluster and they do not have HW version 15 or greater, Operator will mark itself as `degraded` and nodes which don't have right HW version will have annotation `vsphere.driver.status.csi.openshift.io: degraded` added to them.
Cluster will be degraded when:
- We can't talk to Kube API server and for some reason can't read secret or can't create storageclass etc.
- All errors that previously only marked cluster as "un-upgradeable" will be upgraded to *degrade* the cluster if we previously installed CSI driver (detected via presence of `CSIDriver` object).

Additionally vSphere CSI operator will report HW version of VMs that make up the cluster as a metric.
Cluster will be marked as un-upgradable when:
- There are any vCenter related errors (connection refused, 503 errors or permission denied) or any of the checks fail (HW version, esxi verion checks).

In case operator is marked as un-upgradeable for some reason - detailed information will be added to `ClusterCSIDriver` object and an appropriate metric will be emitted.
It should be noted that even though cluster is marked as un-upgradeable, cluster is still fully available and `Available=True` will be set for all cluster objects
and cluster can still be upgraded between z-stream versions but upgrades to minor versions (such as `4.11`) will be blocked.

##### Presence of existing drivers in the cluster being upgraded

A customer may install vSphere driver from external sources. In 4.7 we will install the CSI driver operator but will not proceed with driver install if an existing install of CSI driver is detected. The operator will mark itself as disabled but it will not degrade overall CSO status.
A customer may have installed vSphere driver from external sources. In 4.10 we will install the CSI driver operator but will not proceed with driver
install if an existing install of CSI driver is detected. The mechanism for detecting existing upstream driver will be:

1. Check if there is a `CSIDriver` object of type vSphere CSI driver.
2. Next we will check there is one or more `CSINode` objects with vSphere CSI driver type.

If we detect an existing driver present in the cluster then cluster will be marked "unupgradable". Following steps must be performed to migrate to red hat version of driver:

1. Cluster Admin will delete CSI driver Deployment object.
2. Cluster Admin will delete CSI driver Daemonset objects.
3. In last step Cluster Admin will delete `CSIDriver` and any existing configmap, secrets objects that were installed previously for functioning of CSI driver.

We will additionally gather metrics for such installation and decide in 4.8 timeframe, if we need to mark such clusters as "unupgradable".
We will ensure that these steps are documented in 4.10 docs.

#### Disabling the operator

#### API
Currently disabling the operator is unsupported feature.

### API Extensions

The operator will use https://github.com/openshift/api/blob/master/operator/v1/types_csi_cluster_driver.go for operator configuration and managment.

### User stories
### Operational Aspects of API Extensions

The `ClusterCSIDriver` type is used for operator installation and reporting any errors. Please see [CSI driver installation](csi-driver-install.md) for
more details.

### Test Plan

The operator will be tested via CSI driver e2e. We don't expect operator itself to have any e2e but it will have unit tests that validate specific behavior.

### User Stories

#### Story 1

Expand All @@ -147,12 +180,18 @@ The operator will use https://github.com/openshift/api/blob/master/operator/v1/t
* We don't yet know state of vSphere CSI driver. We need to start running e2e tests with vSphere driver as early as possible, so as we can determine how stable it is.
* We have some concerns about datastore not being supported in storageClass anymore. This means that in future when in-tree driver is removed, clusters without storagePolicy will become unupgradable.

### Test plan
#### Failure Modes

If operator is unable to install CSI driver or is failing for some reason, appropriate condition will be added to `ClusterCSIDriver` object.

* We plan to enable e2e for vSphere CSI driver.
#### Support Procedures

Supported by must-gather logs and metrics.

### Graduation Criteria

#### Dev Preview -> Tech Preview

There is no dev-preview phase.

#### Tech Preview
Expand Down

0 comments on commit 2913529

Please sign in to comment.