-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS EFS CSI driver operator #687
Add AWS EFS CSI driver operator #687
Conversation
And how to update from the community operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. It is a bit divergent from what we've discussed previously -- but in a direction I like :)
* It will add common features of CSI driver operators which are missing in the current community operator, such as | ||
proxy injection and restarting the drivers when cloud credentials change. | ||
|
||
2. Update (rewrite) current aws-efs-operator to watch `ClusterCSIDriver` CR named `efs.csi.aws.com`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will aws-efs-operator ever write (status, spec, labels, annotations, ...) to the CR? (Concerns about one CRD having multiple "owners".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will write status there. The CRD is watched by several operators, but each operator watches its own instance with explicitly defined name and does not touch other CRs.
https://github.com/openshift/api/blob/4b79815405ec40f1d72c3a74bae0ae7da543e435/operator/v1/0000_90_cluster_csi_driver_01_config.crd.yaml#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this.
|
||
5. Ship [efs-utils](https://github.com/aws/efs-utils) as a base image for the EFS CSI driver image. | ||
|
||
6. Update installer to delete Access Points and EFS volumes that have `owner` tag when destroying a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels wrong. We've talked about the barriers (networking, permissions, etc.) that would make this difficult for the operator to manage. IMO for those reasons -- and for symmetry -- all provisioning/deprovisioning on the AWS side should be outside of the purview of the operator.
(To be clear, I agree CI must have the ability to provision and deprovision -- more on this below -- but I don't think this tagging approach is the way to go.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is that it's not possible to destroy a cluster if it has EFS volume. Installer will get stuck at destroying networks, because AWS does not allow to delete those used by EFS volumes. Since it's cluster admin who must create EFS share, it's up to them either to add owner
tag (for automatic deletion on cluster teardown) or preserve it, but deal with the consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more detailed note about installer behavior and opt-in nature of the tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's still a disconnect here.
I'm not on board with the operator trying to talk to AWS. This will lead to confusion and complexity for negligible benefit.
It still requires the consumer (human or CI) to properly configure AWS access credentials. If they don't, the volume removal will fail, and they won't necessarily find out about it except in the case of a blocked cluster uninstall -- and even then, the cause may not be obvious.
It increases your test matrix substantially.
And it's asymmetrical. If the operator can handle deletion, it can easily handle provisioning as well. And that would make a lot more sense to the user.
I don't understand the objection to the CI test case/harness cleaning up explicitly.
After installation of the new operators, either as a fresh install or upgrading the existing one, users must create the | ||
CR when the operator is installed on a new cluster. This is something that was not necessary in the previous releases of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to see proposed changes to the CRD, if any, and an example of what a CR will look like to (de)activate the EFS CSI driver. (Not sure if that level of detail is appropriate for this document, but you included dockerfile snippets so.... :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the CR the cluster-storage-operator creates for EBS: https://github.com/openshift/cluster-storage-operator/blob/master/assets/csidriveroperators/aws-ebs/10_cr.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link to EBS CR.
Library-go does not support un-installation of CSI driver (and static resources created by `StaticResourcesController`), | ||
this must be implemented there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to existing PV, PVC, and SharedVolume resources when the driver is uninstalled? Do they have finalizers that block the deletion until they are removed manually?
(Also it is worth noting that "uninstall" could happen one of two ways: CR edit as assumed here, or uninstall of the operator via OLM. For the latter case, operator-framework/enhancements#46 will be of interest.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a note about what happens when the operator is removed via OLM (driver works, but is not managed, PVs are usable) and when the operand is removed via CR deletion (driver is removed, PVs are not usable).
### `SharedVolume` CRD | ||
|
||
Since `SharedVolume` is trivial to handle, just creates PV + PVC, we leave handling of the CRD as a implementation | ||
detail. We may try to copy code from the community operator, but if it's hard to combine library-go and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy code from the community operator
I'm confused. The rest of this document makes it sound like this feature is accomplished by modifying the community operator rather than starting over somewhere else. However, I got the opposite impression in our previous talks. Which is it?
But to the technical point: I don't know anything about library-go, but aws-efs-operator's handling of SharedVolume requires nothing more complex than k8s primitives (CRUD and Watch). IOW if it's possible to port anything from controller-runtime to library-go, it should be possible to port SharedVolume handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere in the document is written that the operator will be re-written. Therefore, the SharedVolume code may be carried over. SharedVolume is super simple, so re-implemenation may be actually easier than reusing the existing code.
there is no CR and does nothing. This means the old CSI driver is still running (installed by the "old" operator) | ||
and any apps that use it are still running. | ||
5. User reads release notes, documentation or alert and creates th CR. | ||
6. The "new" operator "adopts" the CSI driver objects. Namely, it updates DaemonSet of the CSI driver, which then does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this entail moving several of these objects (the namespace-scoped ones) to a different namespace? If so, not sure there's really a requirement to preserve the same names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered moving the driver to openshift-cluster-csi-drivers
namespace, where AWS EBS CSI driver runs, especially if the operator was installed during cluster installation (having a single predictable namespace helps a lot). We moved to OLM, so using the OLM-native way of managing namespaces is better here. And helps with the migration.
@bertinatto, what do you think? Are we painting ourselves into a corner again?
Added a note that we won't change the namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question about collecting logs with must-gather, is there any difference for putting efs csi driver and operator in different namespaces. If we keep the operator and driver in the old namespace, do we need to create a special image to collect logs for efs csi driver and operator?
* Move the controller-runtime based operator one to a separate branch (`release-4.7`, even though it contains all | ||
previous releases?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or release-pre-4.8
, or release-community
or...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for something else than release-4.7
, since the operator wasn't part of OCP 4.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using release-community
.
|
||
### Open Questions | ||
|
||
* How to graduate community operator into a supported one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, importantly, how to prevent someone from installing the community operator into 4.8? Yeah, that's a tough one. Perhaps the supported operator checks for that, refuses to function, and throws an alert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough if we just remove it from OperatorHub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking OLM, it supports minKubeVersion
in CSV files - we may hide the new operator in 4.7 and earlier. Not sure if there is an easy way how to hide the old operator in 4.8 and newer. Added as an extra open item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough if we just remove it from OperatorHub?
That would prevent pre-4.8 clusters from installing it too, right?
3. Create AWS EFS share (this may require extra privileges in CI and figure out the networking!) | ||
4. Set up dynamic provisioning in the driver (by creating a StorageClass). This assumes dynamic provisioning | ||
is implemented (released) upstream before we start testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend not tying the test plan to dynamic provisioning -- or in fact any direct AWS interaction -- in the actual driver or operator. The test environment itself can be configured with appropriate access to the AWS side; and the test cases can do the provisioning/deprovisioning as separate steps from the operator intsallation and configuration.
We had to do plenty of this while developing the community operator. We never quite got to automating tests end to end, but I wrote a utility that does the AWS networking, security, and provisioning setup/teardown. You may be able to reuse pieces of it here.
As for permissions, I'm almost certain it is already possible to specify AWS creds in CI today. That's all you should need.
|
||
#### Removing a deprecated feature | ||
|
||
TODO: shall we deprecate `SharedVolume`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. You convinced me that it carries a security risk that we can't count on users understanding.
...and if we're going to do this, doing it on the boundary of community=>supported would probably make the most sense. Happy to discuss strategy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have the functionality provided by SharedVolume
implemented in the CSI driver itself. Maybe we can focus on that rather than supporting this CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have the functionality provided by
SharedVolume
implemented in the CSI driver itself.
Can you explain this a bit further, please? Today all SharedVolume
does is accept a FS and AP ID and create a PV/PVC pair for it. It's basically just a shorthand way to create dozens of lines worth of resource definition with two fields. And it's also an RBAC back door -- which was deliberate, but isn't worth the security in my (new and improved) opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the CSI driver should support dynamic provisioning, which has been implemented recently (as you pointed out above).
|
||
### Code organization | ||
|
||
* Reuse existing `github.com/openshift/aws-efs-operator` for the operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add repository
to not confuse with code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should rename it to aws-efs-csi-driver-operator
to be consistent with the other repositories.
Consistent naming is the only requirement for creating new CSI driver/operator repositories (we'd get away with that because this repository already exists, but it'd be nice to make it consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was a new repo, then aws-efs-csi-driver-operator would be great, now that it already exists, I think it's better to leave it as it is - who knows where is it referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, a fork from aws-efs-operator to aws-efs-csi-driver-operator could make a lot of sense. That way the old community operator doesn't need to undergo what's essentially a full rewrite, or deal with the branching issue; and the new operator can have its preferred name but still have the benefit of preserving whatever code may be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having looked at the library-go port WIP, which has nothing in common with the original, I don't see a reason to try to reuse that repository.
* Move the controller-runtime based operator one to a separate branch (`release-4.7`, even though it contains all | ||
previous releases?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for something else than release-4.7
, since the operator wasn't part of OCP 4.7
|
||
### Open Questions | ||
|
||
* How to graduate community operator into a supported one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough if we just remove it from OperatorHub?
|
||
#### Removing a deprecated feature | ||
|
||
TODO: shall we deprecate `SharedVolume`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have the functionality provided by SharedVolume
implemented in the CSI driver itself. Maybe we can focus on that rather than supporting this CR?
After installation of the new operators, either as a fresh install or upgrading the existing one, users must create the | ||
CR when the operator is installed on a new cluster. This is something that was not necessary in the previous releases of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the CR the cluster-storage-operator creates for EBS: https://github.com/openshift/cluster-storage-operator/blob/master/assets/csidriveroperators/aws-ebs/10_cr.yaml
## Proposal | ||
High level only, see design details below. | ||
|
||
1. Rewrite current aws-efs-operator to use [CSI driver installation functions of library-go](https://github.com/openshift/library-go/tree/master/pkg/operator/csi/csicontrollerset). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are already using controller-runtime, have you considered making this an operator-sdk project? We have integration with OLM like operator-sdk run bundle
which will help you test out your bundle with OLM, among other things.
There's also a supported operator-sdk offering beginning with OpenShift 4.7:
https://mirror.openshift.com/pub/openshift-v4/clients/operator-sdk/latest/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to cover why in Design Details below. This is almost complete CSI driver operator using library-go code:
Basically configure a library and run it. It's so comfortable. We would need to re-implement it in operator-sdk and then we would have two code bases that we need to maintain and sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just point out that aws-efs-operator is written against operator-sdk (but 0.16.0, so no run bundle
). Jan has been talking about porting to library-go for $reasons. To editorialize shamelessly, I feel like it's because it's what he's used to: osdk/controller-runtime is likewise "so comfortable" for me :)
* We want the community one working in 4.7 and older and the supported one in 4.8 and newer. | ||
* It should be possible to ship new community updates in 4.7 channel/branch to fix potential bugs there and/or | ||
make the transition as seamless as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the community operator work in 4.8 as well? or are you sunsetting the community operator? Will the community operator also migrate to library-go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd like OperatorHub to offer the community operator in OCP <= 4.7 and the supported one in OCP >= 4.8. Having them both could be confusing.
As written above, code the current operator will be still available in a branch and new community updates can be released from there, without any code change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel like this topic could use more discussion.
@jsafrane I'd like to understand more about the shift from controller-runtime to library-go. What were some of the deficiencies you found with controller-runtime? Have you looked into operator-sdk at all? (see previous comment for urls). We can speak offline about controller-runtime as to not derail this EP. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jmrodri @2uasimojo, thanks for your comments! I updated / rephrased some parts based on your feedback, available as a separate commit for easier tracking.
I looked at operator-sdk a long time ago, we even had our initial cluster-storage-operator written in it. But then we got bitten by a operator-sdk version bump and it was the same effort to move to library-go or to newer operator-sdk / controller-runtime. BTW, I need to check operator-sdk cmdline to generate CSVs and other stuff. That's definitely helpful. |
|
||
3. Keep existing functionality: | ||
* `SharedVolume` CRD in `aws-efs.managed.openshift.io/v1alpha1` API group keeps working as it is now. | ||
* Update from the community operator to the supported operator should be as seamless as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update from the community operator to the supported operator
means uninstall the community operator and reinstall the supported operator?
Do we support update from the supported operator to the community operator
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the complexity of managing "upgrades" (or downgrades) and the low usage of the community operator, is it worth considering only providing documentation for such moves instead?
Currently, all cloud-based CSI drivers shipped by OCP are installed when a cluster is installed on the particular cloud. | ||
AWS EFS will not be installed by default. We will include a note about opt-in installation in docs. | ||
|
||
After installation of the new operators, either as a fresh install or upgrading the existing one, users must create the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one cluster, we should only support installing an efs csi driver operator? Regardless it's the community one or the supported one.
Note: When the *operator* is un-installed, nothing happens to the CRD, CR, the installed CSI driver, existing PVs and | ||
PVCs. Everything is working, just not managed by the operator. When the *operand* is un-instaled by removing the | ||
operator CR, the driver will be removed. Existing PVs, PVCs and SharedVolumes will be still present in the API server, | ||
however, they won't be really usable. Nobody can mount / unmount EFS volumes without the driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question here, if the operator
is un-installed, when we delete the operator CR, the efs csi driver will not be removed? IIUC, the operator
will watch the CR and process the driver install or un-install logic.
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
We agreed with the current aws-efs-operator maintainers to start from scratch, in a different repo.
|
||
3. Keep existing functionality: | ||
* `SharedVolume` CRD in `aws-efs.managed.openshift.io/v1alpha1` API group keeps working as it is now. | ||
* Update from the community operator to the supported operator should be as seamless as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the complexity of managing "upgrades" (or downgrades) and the low usage of the community operator, is it worth considering only providing documentation for such moves instead?
84158a3
to
40d6e0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking solid to me. Thanks for the updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
At this point we've implemented all the functionality and CI jobs are on the works (see this Jira card), so it makes sense to merge this now.
We can make adjustments as follow-ups.
**How to actually hide the community operator in 4.9 and newer operator catalogs?** We don't want to have two operators | ||
in the catalog there. We still want to offer the community operator in 4.8 and earlier though. | ||
|
||
* Right now it seems there will be two operators available in OLM and it's not that bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to tag the community operator as deprecated?
Refine the design after implementation. Update test plan and upgrade scenarios.
Add installation / removal user stories.
/remove-lifecycle stale |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
@jsafrane should we update the status of this enhancement and get it merged? |
Thanks for poking. I marked it as implemented. @bertinatto, PTAL. |
/lgtm |
/assign @knobunc |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
AWS EFS CSI driver (+ corresponding operator) is an optional OCP component. It should be installed through OLM, when
users opts-in.
This document describes existing (unsupported) solution and how to turn it into a supported one, both from code (and
build and shipment), and from user perspective.