Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI support proposal #1661

Merged
merged 20 commits into from
Dec 16, 2019
Merged

CSI support proposal #1661

merged 20 commits into from
Dec 16, 2019

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Jul 12, 2019

Remaining:

  • disabling CSI snapshotting, similar to disabling Velero-native snapshots
  • skipping CSI snapshotting in the plugins if the volume is covered by restic
  • mention feature flags/gates
  • use ownerRefs
  • deleting the underlying snapshot when the backup expires (using a Retain deletion policy means it will not be cleaned up by default)

Signed-off-by: Nolan Brubaker brubakern@vmware.com

design/csi-snapshots.md Outdated Show resolved Hide resolved
@nrb nrb added the Area/Design Design Documents label Jul 16, 2019
@nrb nrb requested review from carlisia and prydonius July 17, 2019 17:45
design/csi-snapshots.md Outdated Show resolved Hide resolved
design/csi-snapshots.md Outdated Show resolved Hide resolved
It was [introduced with dynamic provisioning support][15] in 2016, predating CSI.

In the `BackupItemAction` for PVCs, the associated PV will be queried and checked for the presence of `PersistentVolume.Spec.PersistentVolumeSource.CSI`.
Volumes with any other `PersistentVolumeSource` set will use Velero's current VolumeSnapshotter plugin code path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where you'd want to use Velero's current VolumeSnapshotter instead of CSI even if CSI was used to provision the volume? For example, if the CSI driver supported provisioning but not snapshotting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's a good possibility. I know there's gRPC methods for finding out if the driver supports snapshots, but I'd have to check to see if we can query the driver from k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our in-tree VolumeSnapshotters won't work as-is for CSI volumes, even if they're on the right platform, due to how the GetVolumeID/SetVolumeID functions work (and maybe other things too, not sure). We could look at changing that if we did want to be able to support using them to snapshot CSI volumes, but right now it won't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: currently there is no way to query what a driver supports from k8s. So we can't really know if CSI snapshotting is supported via a query to something like the storageclass at the moment.

I would say to simplify things, if CSI volumes are in use but the driver doesn't have snapshotting, restic would be the best option.

Copy link

@timh timh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My direct comments are all just typos/re-words.
Generally, though, this looks good. In the list of "changes needed in plugins", though, it'd be good to see a "To support aspect X of feature Y, we need to change Z". You're just including "Z" without the context. IMO, this design doc reads pretty densely to someone not deeply familiar with Velero's implementation.

design/csi-snapshots.md Outdated Show resolved Hide resolved
design/csi-snapshots.md Outdated Show resolved Hide resolved
design/csi-snapshots.md Outdated Show resolved Hide resolved
design/csi-snapshots.md Outdated Show resolved Hide resolved
design/csi-snapshots.md Outdated Show resolved Hide resolved
Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read a bit about CSI and read this closely a few times. No input at the moment, lgtm.

@nrb
Copy link
Contributor Author

nrb commented Aug 8, 2019

Per feedback, we need a way of telling Velero not to create CSI snapshots. Maybe use the SnapshotVolumes field on the backup?

design/csi-snapshots.md Outdated Show resolved Hide resolved
@skriss
Copy link
Contributor

skriss commented Aug 12, 2019

I know you brought up feature flags recently - I do think we should consider adding them, and adding one for the CSI work, as we're developing here. I imagine we're going to ship this in parts (e.g. first add support for creating CSI snapshots, then add support for restoring them, ...) and it would be nice to allow users to opt-in until they're ready.

@skriss
Copy link
Contributor

skriss commented Aug 12, 2019

One other thing to think about - similar to existing code, we'll want to skip creating snapshots if restic is being used to back up a PVC - I'm not exactly sure how we'll do that given that the CSI code is in plugins. Need to think about it.

@nrb
Copy link
Contributor Author

nrb commented Aug 12, 2019

I created #1759 for the feature flag work. It likely needs a bit more detail, but at least we have a place to discuss it now.

@nrb nrb changed the title Initial CSI proposal CSI support proposal Aug 13, 2019
nrb added 13 commits August 19, 2019 12:53
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Aug 19, 2019

@skriss @prydonius @carlisia I believe all the items in the checklist are now addressed. I've also done a bit of reorganization of the sections. Please take another look!

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍


To ensure this is the case, server code must be extended so that PVCs are marked in such a way that the CSI BackupItemAction plugin is not invoked for PVCs known to be backed up by restic.

The restic backupper's `BackupPodVolumes` method will be extended to get PersistentVolumeClaims where relevant and add a `velero.io/restic-backed: true` label.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a couple of challenges with this approach - (1) BackupPodVolumes will actually run after the PVC has been backed up, since the PVC is backed up as an additional item for the pod prior to the restic backups running; and (2) the label value would need to have the name or UID of the backup, so we know if the PVC is being backed up with restic for this backup, not just any previous backup.

One alternative approach for consideration - you could have the csi-pvc backup item action list all pods in the namespace that the PVC is in, find the pod that is mounting the PVC, and check if it's annotated for restic backup. If we do something like that, it means we don't need to add a new label/annotation which I think is desirable.

Copy link
Contributor

@skriss skriss Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer-term, I wonder if we should pass a richer "context" object to our plugins that includes information like this (i.e. the resticSnapshotTracker). Right now one of the drawbacks of implementing logic in plugins is that you don't get access to some of those internal data structures used by pkg/backup or pkg/restore to track what's happening. If we could expose those to plugins, things would be a lot easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really wanted to use the resticSnapshotTracker, but opted for the label since that could be used with the AppliesTo function, thereby not even invoking the Execute method for csi-pvc when a PVC should be skipped. Was hoping to avoid too many extra API server calls to determine a no-op, but it may be unavoidable for now.

I'm not sure I want to completely redo the plugin interface in this proposal, as I'd like for CSI support to be a 1.x feature. However, I think providing some sort of context object would be very useful for plugins in the future.

Thanks for the call out on the PVCs being backed up during the Pod's process; I had forgotten that the PVCs were backed up immediately instead of being added to the PVC group's loop.

Your alternative sounds do-able, I think.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Aug 27, 2019

I'm going to hold off on updating this pending decisions made in kubernetes/kubernetes#81792 and kubernetes/enhancements#989


## Alternatives Considered

* Implementing similar logic in a Velero VolumeSnapshotter plugin was considered.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betta1 Antony, is this summary of your experiments with making a VolumeSnapshotter plugin accurate? Is there anything I missed here that you encountered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrb yes this is accurate and is in line with our initial experiments with a CSI VolumeSnapshotter plugin -- it's been a while but i remember our biggest issue was that the VolumeSnapshotter interface didn't provide some of the metadata like PVC/NameSpace/StorageClass/ that are required for CSI snapshots. We preferred the approach you had taken with your CSI prototype utilizing Backup and restore Item actions since with item actions we can get access to these metadata. Btw I tested your prototype and had filed an issue (https://github.com/vmware-tanzu/velero-plugin-for-csi/issues/1) -- backup worked ok, ran into some issue with restore.

@nrb
Copy link
Contributor Author

nrb commented Dec 2, 2019

Possible issue to think through:

  • Take a backup of a namespace w/ snapshot on Monday - the VolumeSnapshot CRD, which is namespaced should be included, along with its associated VolumeSnapshotContent CRD
  • Take a backup of the same namespace w/ snapshot on Tuesday - the VolumeSnapshot and VolumeSnapshotContent CRDs for Tuesday and Monday should be included.
  • Delete the Monday backup - Velero should remove the VolumeSnapshot, VolumeSnapshotContent, and snapshot.
  • Delete the namespace.
  • Restore the Tuesday backup - this restores the Monday VolumeSnapshot & VolumeSnapshotContent objects, which are no longer valid as they point to a snapshot that was deleted.

Should the RestoreItemAction for VolumeSnapshot/VolumeSnapshotContent perhaps validate that the snapshot is there before adding the CRDs to the cluster? If so, how? I'm not sure if there's a generic way to query CSI for a snapshot's existence right now.

@nrb
Copy link
Contributor Author

nrb commented Dec 2, 2019

I'm not sure if there's a generic way to query CSI for a snapshot's existence right now.

There's likely a way to do this via the gRPC API, otherwise the CSI drivers couldn't work.

@skriss
Copy link
Contributor

skriss commented Dec 2, 2019

This is an "import" case where the underlying snapshot does not exist, right? Will the snapshot controller observe this behavior and set some phase/condition on the VS/VSC object?

@nrb
Copy link
Contributor Author

nrb commented Dec 2, 2019

Yeah, this would be a case where the snapshot doesn't exist. I'll review the snapshot controller code and see.

design/csi-snapshots.md Outdated Show resolved Hide resolved
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Dec 4, 2019

Here is the recording of the meeting for our discussion of the Velero/CSI integration.

The summary slides are at https://docs.google.com/presentation/d/1QG0gu5Y9xiF6rjkqFgsOMOWMjSkj6Il18dNWKWRjipo/edit?usp=sharing.

The plan currently is to get the proof of concept plugins up to date with the v1beta types, get the design document merged, then ensure that the Velero core code base is updated to reflect necessary changes behind the EnableCSI feature flag - these changes will be communicated in release notes and blog posts when it's released. We'll also be giving regular updates during our community meetings.

@skriss
Copy link
Contributor

skriss commented Dec 4, 2019

@carlisia @prydonius please try to review this again over the next few days, with an eye towards any questions/issues that need to be resolved before moving on to implementation. I'll do the same as well. I think we're just about there - but raise anything you see!

cc @ashish-amarnath

@nrb
Copy link
Contributor Author

nrb commented Dec 4, 2019

@skriss I think the only thing I have left in my notes to put in here is to mention the VSL question you had. I'll get that in tomorrow morning.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Dec 5, 2019

@skriss VSL and VolumeSnapshotClass comparison added, let me know if I should expand more.

Copy link
Contributor

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @nrb IMHO we've got enough clarity to move onto implementation. I'd just remove the "Draft" status from this before merging (we've decided to remove the "Status" line from design docs entirely since you originally authored this one).

Will let others finish reviewing, so not merging yet.

Let's get this broken out into individual issues/stories to work on next - happy to collaborate on that if you want.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@skriss
Copy link
Contributor

skriss commented Dec 11, 2019

LGTM. @carlisia @prydonius @ashish-amarnath let's set a deadline of this Friday for getting this reviewed - if there are no further comments by then, we'll merge and get on with it :)

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-red this doc, looks great 👍

@skriss
Copy link
Contributor

skriss commented Dec 16, 2019

merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants