-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI support proposal #1661
CSI support proposal #1661
Conversation
It was [introduced with dynamic provisioning support][15] in 2016, predating CSI. | ||
|
||
In the `BackupItemAction` for PVCs, the associated PV will be queried and checked for the presence of `PersistentVolume.Spec.PersistentVolumeSource.CSI`. | ||
Volumes with any other `PersistentVolumeSource` set will use Velero's current VolumeSnapshotter plugin code path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where you'd want to use Velero's current VolumeSnapshotter instead of CSI even if CSI was used to provision the volume? For example, if the CSI driver supported provisioning but not snapshotting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a good possibility. I know there's gRPC methods for finding out if the driver supports snapshots, but I'd have to check to see if we can query the driver from k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our in-tree VolumeSnapshotters won't work as-is for CSI volumes, even if they're on the right platform, due to how the GetVolumeID/SetVolumeID functions work (and maybe other things too, not sure). We could look at changing that if we did want to be able to support using them to snapshot CSI volumes, but right now it won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: currently there is no way to query what a driver supports from k8s. So we can't really know if CSI snapshotting is supported via a query to something like the storageclass at the moment.
I would say to simplify things, if CSI volumes are in use but the driver doesn't have snapshotting, restic would be the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My direct comments are all just typos/re-words.
Generally, though, this looks good. In the list of "changes needed in plugins", though, it'd be good to see a "To support aspect X of feature Y, we need to change Z". You're just including "Z" without the context. IMO, this design doc reads pretty densely to someone not deeply familiar with Velero's implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read a bit about CSI and read this closely a few times. No input at the moment, lgtm.
Per feedback, we need a way of telling Velero not to create CSI snapshots. Maybe use the |
I know you brought up feature flags recently - I do think we should consider adding them, and adding one for the CSI work, as we're developing here. I imagine we're going to ship this in parts (e.g. first add support for creating CSI snapshots, then add support for restoring them, ...) and it would be nice to allow users to opt-in until they're ready. |
One other thing to think about - similar to existing code, we'll want to skip creating snapshots if restic is being used to back up a PVC - I'm not exactly sure how we'll do that given that the CSI code is in plugins. Need to think about it. |
I created #1759 for the feature flag work. It likely needs a bit more detail, but at least we have a place to discuss it now. |
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
33cf33e
to
7d3ab7b
Compare
@skriss @prydonius @carlisia I believe all the items in the checklist are now addressed. I've also done a bit of reorganization of the sections. Please take another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
design/csi-snapshots.md
Outdated
|
||
To ensure this is the case, server code must be extended so that PVCs are marked in such a way that the CSI BackupItemAction plugin is not invoked for PVCs known to be backed up by restic. | ||
|
||
The restic backupper's `BackupPodVolumes` method will be extended to get PersistentVolumeClaims where relevant and add a `velero.io/restic-backed: true` label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a couple of challenges with this approach - (1) BackupPodVolumes will actually run after the PVC has been backed up, since the PVC is backed up as an additional item for the pod prior to the restic backups running; and (2) the label value would need to have the name or UID of the backup, so we know if the PVC is being backed up with restic for this backup, not just any previous backup.
One alternative approach for consideration - you could have the csi-pvc
backup item action list all pods in the namespace that the PVC is in, find the pod that is mounting the PVC, and check if it's annotated for restic backup. If we do something like that, it means we don't need to add a new label/annotation which I think is desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer-term, I wonder if we should pass a richer "context" object to our plugins that includes information like this (i.e. the resticSnapshotTracker
). Right now one of the drawbacks of implementing logic in plugins is that you don't get access to some of those internal data structures used by pkg/backup
or pkg/restore
to track what's happening. If we could expose those to plugins, things would be a lot easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I really wanted to use the resticSnapshotTracker, but opted for the label since that could be used with the AppliesTo
function, thereby not even invoking the Execute
method for csi-pvc
when a PVC should be skipped. Was hoping to avoid too many extra API server calls to determine a no-op, but it may be unavoidable for now.
I'm not sure I want to completely redo the plugin interface in this proposal, as I'd like for CSI support to be a 1.x feature. However, I think providing some sort of context object would be very useful for plugins in the future.
Thanks for the call out on the PVCs being backed up during the Pod's process; I had forgotten that the PVCs were backed up immediately instead of being added to the PVC group's loop.
Your alternative sounds do-able, I think.
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
I'm going to hold off on updating this pending decisions made in kubernetes/kubernetes#81792 and kubernetes/enhancements#989 |
|
||
## Alternatives Considered | ||
|
||
* Implementing similar logic in a Velero VolumeSnapshotter plugin was considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betta1 Antony, is this summary of your experiments with making a VolumeSnapshotter plugin accurate? Is there anything I missed here that you encountered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrb yes this is accurate and is in line with our initial experiments with a CSI VolumeSnapshotter plugin -- it's been a while but i remember our biggest issue was that the VolumeSnapshotter interface didn't provide some of the metadata like PVC/NameSpace/StorageClass/ that are required for CSI snapshots. We preferred the approach you had taken with your CSI prototype utilizing Backup and restore Item actions since with item actions we can get access to these metadata. Btw I tested your prototype and had filed an issue (https://github.com/vmware-tanzu/velero-plugin-for-csi/issues/1) -- backup worked ok, ran into some issue with restore.
Possible issue to think through:
Should the RestoreItemAction for VolumeSnapshot/VolumeSnapshotContent perhaps validate that the snapshot is there before adding the CRDs to the cluster? If so, how? I'm not sure if there's a generic way to query CSI for a snapshot's existence right now. |
There's likely a way to do this via the gRPC API, otherwise the CSI drivers couldn't work. |
This is an "import" case where the underlying snapshot does not exist, right? Will the snapshot controller observe this behavior and set some phase/condition on the VS/VSC object? |
Yeah, this would be a case where the snapshot doesn't exist. I'll review the snapshot controller code and see. |
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Here is the recording of the meeting for our discussion of the Velero/CSI integration. The summary slides are at https://docs.google.com/presentation/d/1QG0gu5Y9xiF6rjkqFgsOMOWMjSkj6Il18dNWKWRjipo/edit?usp=sharing. The plan currently is to get the proof of concept plugins up to date with the v1beta types, get the design document merged, then ensure that the Velero core code base is updated to reflect necessary changes behind the |
@carlisia @prydonius please try to review this again over the next few days, with an eye towards any questions/issues that need to be resolved before moving on to implementation. I'll do the same as well. I think we're just about there - but raise anything you see! |
@skriss I think the only thing I have left in my notes to put in here is to mention the VSL question you had. I'll get that in tomorrow morning. |
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@skriss VSL and VolumeSnapshotClass comparison added, let me know if I should expand more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @nrb IMHO we've got enough clarity to move onto implementation. I'd just remove the "Draft" status from this before merging (we've decided to remove the "Status" line from design docs entirely since you originally authored this one).
Will let others finish reviewing, so not merging yet.
Let's get this broken out into individual issues/stories to work on next - happy to collaborate on that if you want.
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
LGTM. @carlisia @prydonius @ashish-amarnath let's set a deadline of this Friday for getting this reviewed - if there are no further comments by then, we'll merge and get on with it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-red this doc, looks great 👍
merging! |
Remaining:
Retain
deletion policy means it will not be cleaned up by default)Signed-off-by: Nolan Brubaker brubakern@vmware.com