Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume Snapshot in Kubernetes - Snapshot API Object discussion #44172

Closed
jingxu97 opened this issue Apr 6, 2017 · 53 comments
Closed

Volume Snapshot in Kubernetes - Snapshot API Object discussion #44172

jingxu97 opened this issue Apr 6, 2017 · 53 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 6, 2017

Inspired by @MikaelCluseau's suggestion, The following is a list of discussions related to Snapshot API object design.

  1. Use new API objects for snapshot or not
    I think having new API objects to represent snapshots will be very convenient to support different operations on snapshots. The possible operations could be create snapshot from volume, create new volume or restore volume from snapshot, list and delete snapshots, show the status (available, pending etc) and detailed information (creation time, size) of a snapshot. So it would gives us more flexibility compared to just tagging the snapshot information into a PV.

  2. How snapshot is created
    Although it is possible to use just PVC to initiate a snapshot creation, I think using a new API object (SnapshotRequest) to represent snapshot creation gives much more flexibility and support more options.
    a). It could support both one-time snapshot creation or scheduled snapshot with a lot more snapshot policy options. For example, how many snapshots could be taken periodically for a volume. A snapshot should be deleted after a certain amount of time to save space. How much storage space could be used by snapshots etc.

    b). It could support to create snapshots for a list of volumes. For system admins, they might want to create snapshots for all volumes in the system overnight for backup purpose. Instead of specifying only one snapshot per PVC, they could use wild card to list all volumes in the snapshot request.

    c). It could support to create snapshot for pod, statefulset, deployments etc. Users might not have direct information about the PVs for their pods. In snapshot request, the sources to create snapshots could be PVs, or pod, statefulset and kubernetes can retrieve the volumes information from the pod objects and take snapshots for the user.

  3. How to create a new volume and restore a volume from a snapshot.
    Currently kubernetes support dynamic provisioning to create a new volume for a PVC. To create a new volume from a snapshot is similar to this process. I am thinking again to use SnapshotRequest to perform this task by giving the snapshot identifier and some keyword to indicate to create a new volume or restore the existing volume. If a new volume is created, a new PV object will be created to represent this volume. In case of restore, the existing PV still represents the reverted volume. I think this way is cleaner than mixing this function into PVC which requires modifying the PVC object.

  4. Namespace or not.
    Similar to PVC and PV, we could make SnapshotRequest as namespaced, and snapshots object as non-namespaced so that user/developer can use request.

@MikaelCluseau's Example

kind: SnapshotRequest
metadata:
  namespace: production
  name: snap_app1_20170406_010000
spec:
  persistentVolumeClaims:  #Could be a list of PV, pods, or snapshot id (create volume from a snapshot)
  - data.app-1-db-0
  - data.app-1-db-1
  - data.app-1-mq-0
  - data.app-1-mq-1
  - data.app-1-filestore-0

kind: Snapshot
metadata:
  name: production-pv-for-claim-data.app-1-db-0@snap_app1_20170406_010000
spec:
  volume: production-pv-for-claim-data.app-1-db-0
  requestRef: #the ref could be other type too
    namespace: production
    name: snap_app1_20170406_010000

Any comments and suggestions for this would be appreciated!

@jingxu97 jingxu97 added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Apr 6, 2017
@jingxu97
Copy link
Contributor Author

jingxu97 commented Apr 6, 2017

cc @kubernetes/sig-storage-api-reviews @kubernetes/sig-storage-feature-requests @kubernetes/sig-storage-misc

@wattsteve
Copy link
Contributor

cc @tsmetana

@0xmichalis 0xmichalis added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 6, 2017
@childsb
Copy link
Contributor

childsb commented Apr 6, 2017

Just to be clear, this is a two object split.. correct? There's the "SnapShot" object which has no namespace.. then a "SnapShotRequest" which is a namespaced object..

As we add more imperative actions to PVs, should we have a 'generic' request object that would allow Snapshot, Replication, Resize, Restore, etc...

@rootfs
Copy link
Contributor

rootfs commented Apr 6, 2017

And would SnapshotRequest also point to the Snapshot once snapshot succeeds?

@mcluseau
Copy link
Contributor

mcluseau commented Apr 6, 2017

@childsb I don't think having a generic request object is something we want to introduce now (and I suspect we won't need it because nearly everything will be specific to each case).

@rootfs I think I even made a mistake by adding the requestRef field in Snapshot; we should do exactly like the ReplicaSet/Pod relationship to have a consistent UX (and developer experience too ;)). We already have ObjectMeta.ownerReferences that even states something interesting:

If ALL objects in the list have been deleted, this object will be garbage collected. [...]

So owners will be the SnapshotRequest as the controller, and any volume created from the snapshot. The only thing we miss in this is the source of the Snapshot in the request (which pod's volume / PVC). I suspect this is somehow indexed by Kubernetes for already existing needs (Deployment creating ReplicaSets creating Pods).

@mcluseau
Copy link
Contributor

mcluseau commented Apr 6, 2017

And maybe rename SnapshotRequest to SnapshotSet ?

@childsb
Copy link
Contributor

childsb commented Apr 6, 2017

@MikaelCluseau a generic PVR (persistent volume request)... could be similar to PV where there's a set of common fields shared and specific fields that are specific to the actionable command. We could still keep the strong typing of fields for each request type.

Examples:

apiVersion: v1
kind: PersistentVolumeRequest
metadata:
  name: snapshot1
spec:
  snapshot:
    pvc:pvc1
apiVersion: v1
kind: PersistentVolumeRequest
metadata:
  name: action-create
spec:
  createVolumeFromSnapshot:
    snapshot:snapshot1
apiVersion: v1
kind: PersistentVolumeRequest
metadata:
  name: action5
spec:
  resizePVC:
    pvc:pvc1
    newSize:25Gi
apiVersion: v1
kind: PersistentVolumeRequest
metadata:
  name: action6
spec:
  restore:
    snapshot:snapshot1

@jingxu97
Copy link
Contributor Author

jingxu97 commented Apr 6, 2017

@childsb your proposal of a generic request is very interesting. But I have a concern about the complexity of the controller. Now with generic request, we will have many different operations from the request, we need to think through how such controller works.

@mcluseau
Copy link
Contributor

mcluseau commented Apr 6, 2017

@childsb to me it's not similar; a PV is an actual thing in the system (a mount point) that the user wants to manipulate. It just happens that this thing has a relatively complex setup because of the diversity of source and the lack of standardization in the storage industry. I think we can all agree that the PV object would be much better if we could simply have expressed the source as a path in the storage system instead of a different thing for each storage tech out there.

Here, the actual thing we manipulate is a snapshot (hence the mistake in the name SnapshotRequest, I should definitely have said SnapshotSet). In terms of consistent UX, we have StatefulSets and Deployments, not a generic object even though they share a common generic logic that's materialized as a ReplicaSet.

So, IMHO, if we need something generic at some point, I really think it shouldn't be a first class object.

@jsafrane
Copy link
Member

jsafrane commented Apr 7, 2017

@childsb, this is not how Kubernetes works. It's intent based. https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#types-kinds

Objects represent a persistent entity in the system.
Creating an API object is a record of intent - once created, the system will work to ensure that resource exists.

PersistentVolumeRequest is not an persistent entity in the system, it's command. To create a snapshot user should create an object representing the snapshot and Kubernetes will create it. To resize a volume, user modifies the volume and Kubernetes will try to resize it. It's complicated by security aspects, we don't show PVs to users, but that's why there is a claim. Users should resize claims and Kubernetes would try to make it happen in real objects outside of Kubernetes. Similarly with snapshots.

@jsafrane
Copy link
Member

jsafrane commented Apr 7, 2017

@jingxu97, IMO there needs to be an object that represents single snapshot and is in user namespace so the user can see what's its status and we can have quota around it. I don't have a good name for it, I'm going to use SnapshotClaim in this comment, but I know it's ugly (user is not claiming a snapshot, he's creating it...)

Any object that creates multiple snapshots, say SnapshotSet, should be IMO a separate Kind. It will spawn multiple SnapshotClaims in user namespace so the user can see what snapshots are available, which of them succeeded/failed and then the user can restore or delete them individually.

If there should be a Snapshot object in the global namespace is another discussion, I have nothing against it, however these should not be visible to users, similarly as PVs are not visible to users because of security. All regular operations with snapshots must be done by a user creating/modifying/deleting objects in user's namespace.

@mcluseau
Copy link
Contributor

mcluseau commented Apr 7, 2017

SnapshotClaim in this comment, but I know it's ugly (user is not claiming a snapshot, he's creating it...)

It's not that ugly, with dynamic provisionners, PVClaims are actually creating PVs.

@jsafrane
Copy link
Member

jsafrane commented Apr 7, 2017

SnapshotClaim in this comment, but I know it's ugly (user is not claiming a snapshot, he's creating it...)

It's not that ugly, with dynamic provisionners, PVClaims are actually creating PVs.

Important is the intent - with PVC, user wants to claim an existing PV. A new one is created only when there is nothing to claim. With SnapshotClaim, user probably does not want to claim an existing snapshot, he really wants to create a new one.

@tsmetana
Copy link
Member

tsmetana commented Apr 7, 2017

I actually like the idea of SnapshotRequest to create a Snapshot (object). The one thing I dislike in this proposal is the snapshotting of multiple volumes at once: the result would be multiple objects outside the cluster anyway so having them represented with separated object looks much more intuitive to me. It is also more flexible. I'd rather use some simple mechanism for marking snapshots belonging to one group.

@mcluseau
Copy link
Contributor

mcluseau commented Apr 7, 2017

The important part of "at once" is "consistent" ;-)

@jsafrane
Copy link
Member

jsafrane commented Apr 7, 2017

The important part of "at once" is "consistent" ;-)

We discussed this use case and we're trying to avoid it in the first version as it needs support in the application. How else do you ensure consistent snapshots of 10 different EBS volumes running say Cassandra?

@mcluseau
Copy link
Contributor

mcluseau commented Apr 7, 2017

I have no problem with pushing the "multiple snapshots at once" outside the scope of a first version.
To answer your question, without app-support, I think pretty good quality can be achieved just with fsfreeze, as any quorum-based CP system will stop acknowledging writes when the majority is frozen (and for AP systems you have already chosen to accept some loss). I'm not going to have an implementation here, but something along kubelets synchronizing themselves with leases could be a workable path (for instance).

@tsmetana
Copy link
Member

tsmetana commented Apr 7, 2017

fsfreeze is supported only on ext3/4, xfs, reiser and jfs... The proposal should be filesystem-agnostic and as such I don't think we should take responsibility of the the data consistency at all.

@jsafrane
Copy link
Member

jsafrane commented Apr 7, 2017

In addition, fsfreeze ensures just filesystem consistency, not consistency of the database stored on the filesystem. It may be corrupted.

Anyway, it's fine to have a high-level object like SnapshotSet and we can argue how to make it consistent later, but we need also an object that represents single snapshot in user namespace. The user needs to see what snapshots are available, their states and he/she needs a way how to restore / delete individual snapshots and not whole set. Something like StatefulSet and Pods - both are in user namespace and user can monitor them individually.

@mcluseau
Copy link
Contributor

mcluseau commented Apr 7, 2017

In addition, fsfreeze ensures just filesystem consistency, not consistency of the database stored on the filesystem. It may be corrupted.

Let's forget SnapshotSet; this applies to a single snapshot too. We won't fix users wishing filesystems without fsfreeze or equivalent support, but not using fsfreeze when possible (99% of the cases?) will cause much more damage. We won't fix in-app cache flushing generically either. Obviously, we'll have to document clearly what we do, and also do the best we can.

@jingxu97
Copy link
Contributor Author

jingxu97 commented Apr 7, 2017 via email

@jingxu97
Copy link
Contributor Author

There are a few questions have been discussed related to snapshot API design including whether the API object should be in user namespace or not, whether there should be two objects such as Request and snapshot (also snapshotSet). The following is a summary of the points.

Namespace discussion

  1. Unlike PVs that are created by admins only, snapshots could be initiated by users or admins. Clearly a snapshot object is used to show the detailed information and status about the snapshot. If user creates the snapshots, the objects should be in user namespace so that user can query/list them. Admins might also create snapshots for the PVs available in the system for data backup purpose. Those snapshots are not belong to a specific user namespace and might be used by any user.
  2. How snapshots are used is quite unique. User might want to create or restore volumes from the snapshots they have taken. It is also possible that they want to use the snapshots taken by the admins. Also considering data replication and some other scenarios, snapshots taken from a namespace might be needed in another namespace.
    volume snapshot in kubernetes

Now we propose a two-object API design. Similar to PVC/PV, snapshot has two related objects, VolumeSnapshotRequest and VolumeSnapshot.

  1. VolumeSnapshotRequest (VSR) is a request for snapshot by a user. Similar to PVC, VSR is a usernamespaced object, which provides a way to request or consume snapshot resources.
  2. VolumeSnapshot (VS) is a representation of the snapshot taken from a volume. This API object captures the details of the implementation of the snapshot. Similar to PV, VS is a non-namespaced object. There are two ways to create VS: statically or dynamically.
    Static:
    If there are existing snapshots created outside of kubernetes, an administrator can create VSes to represent details of the snapshots which are available for use by cluster users. Users create VSRs in their namespaces and point them to the corresponding VSes. VSR becomes a handler to access those snapshots by a user.
    Dynamic
    User could also user VSR to create snapshots from a volume on-demand. VSR specifies which volume the snapshot should be taken from. In such case, a corresponding VS is created. VSR and VS will have pointers to each other.

Many-to-one mapping between VSR and VS
There could be use cases that snapshots are needed across different namespaces, e.g. for data replication. To achieve this, here we propose to support many-to-one mapping relationship between VSR and VS. There is a field “allowedNamespaces” in VS/VSR object to specify in which namespaces the VSRs could be created for the snapshot. For static creation, system admins can add the list of namespaces. In dynamic creation, user specifies the namespace list which will be propagated to the VS.
If a user want to access VS from a different namespace, he/she could create a VSR to reference the VS from the namespace as long as it is in the “allowedNamespaces” list. In this way, VSR and VS is many-to-one mapping.

Differences between PVC/PV and VSR/VS
Binding between PVC/PV requires atomic transaction to avoid race condition causing multiple PVCs bind to the same PV. PVC and PV is one-to-one mapping so that we have to make sure only one PVC could bind to the PV. However, snapshots does not have this requirement so that the binding between VSR and VS does not need to be atomic.
Since VSR/VS is many-to-one mapping, delete a VSR is not necessary trigger a deletion for VS unless this VSR is the last reference to it.

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 25, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 25, 2017 via email

@0xmichalis
Copy link
Contributor

I think snapshots for data replication will be very useful for StatefulSets and that means that data stays in the same namespace. I am not sure of the security implications of cross-namespace snapshots and the proposed mechanism needs to be vetted by @kubernetes/sig-auth-api-reviews folks.

I share Clayton's thoughts, why can't the flow simply include a single object and as a user I create such an object (volumeSnapshotRequest or snapshotRequest) and get back a PVC bound to the data I asked a snapshot for? Admin backups can still be PVs, no?

@jingxu97
Copy link
Contributor Author

jingxu97 commented Apr 26, 2017 via email

@0xmichalis
Copy link
Contributor

The outcome of the snapshotRequest is a new PV ("the data") bound to a new PVC. The new PVC is what I get back in the namespace I did the snapshoting.

@eparis
Copy link
Contributor

eparis commented Apr 27, 2017

SnapShots are NOT volumes. When dealing with a storage array once can create a snapshot of a volume. In many storage arrays a 'snapshot' could in fact just be a diff from some previous state. One can later create a volume from that snapshot. We plan to represent that similarly in kube.

Let me walk through a flow at a high level:

  1. User creates a PVC which creates a PV
  2. User uses this PVC like always, nothing interesting yet.
  3. User creates a SnapShotRequest which points to a PVC
  4. System triggers the snapshot creation action on the storage array. System also creates a SnapShot API object with data from the storage array which is bound to the SnapShotRequest.
  5. At some later point a user might create a new PVC which points to the SnapShotRequest
  6. System triggers the snapshot->volume action on the storage array. System also creates a new PV API object from the storage array result bound to the new PVC
  7. User uses this new PVC like always, nothing interesting anymore.

I had a discussion with clayton and tend to agree. It is bikeshedding but important bikeshedding. Lets change the object names. Users should create something called a Snapshot. The system should create an object with a less human friendly name, maybe SnapshotRepresentation or something. We don't call it a PodRequest. We don't call it an IngressRequest. The user visible name should be Snapshot. I think we messed that up for PVs and think we should get it right here. I don't really care what we call the other half. But I do feel pretty strongly about the user facing name.

(In kube -v2 I think we should make it PV and VolumeRepresentation or something like that)

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 27, 2017 via email

@jeffvance
Copy link
Contributor

jeffvance commented Apr 28, 2017

@eparis Re 5. At some later point a user might create a new PVC which points to the SnapShotRequest:
Why would the new PVC point to the snapshot request object rather than the snapshot object? In other words, the user/PVC is not requesting another snapshot, he/she is requesting a PV based on a snapshot, right?

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 28, 2017 via email

@tsmetana
Copy link
Member

Thanks for the discussion. We're moving forward.

We will use two objects and it also looks like we have agreed to rename SnapshotRequest to Snapshot. Now we need to rename the second object. I have no preferences whatsoever (SnapshotRepresentation sounds OK to me).

@jingxu97
Copy link
Contributor Author

jingxu97 commented Apr 28, 2017

I have been thinking more about this two-object proposal and PV/PVC relationship and I feel that that two-object model are necessary for pre-existing volume case, but not for dynamic provisioning case.

  1. The need for two objects: Just like node, pre-existing and configured disk storage is a type of resources that could be shared by different users. Based on the user's request, scheduler/controller finds an appropriate resources to be consumed by user. A PVC or pod spec is the place to list the requirements. Once a pod is assigned to a node, or a PVC is bound to a PV, the API object pod or PVC becomes a handler for user to access the chosen resources. Once finishing using the resources, the disk resource and its representation PV could becomes available to be consumed by others again. It supports binding and release operations between these two objects PV and PVC.

  2. Considering only dynamic provisioning case, a new volume is created for PVC creation and will be deleted for PVC deletion. I feel two objects representation is not necessary. Just like a pod, since this volume will not be reused by anyone else, one object representation is enough.

  3. For security reason, two objects with one object in non-namespace could prevent some user to forge a PVC/PV to access other people's data. But one object with subresource or hidden field could also reach the same goal, I think.

Go back to snapshot, similarly, depending on the use cases, two objects might be needed or not.

  1. If only consider the case that users create a snapshot and use a snapshot to create new volume or roll back a volume, a single object of snapshot is good enough.

  2. If we consider snapshot is a resource could be shared across different users/namespaces, two objects model might better support it. In such case, the snapshot request from a user does not really create a new snapshot, instead it binds to an existing snapshot entity (e.g., based on PVC information). But I do feel this binding is not as natural as PVC/PV case. I have a many-to-one mapping for two snapshot objects proposal to allow cross-namespace sharing, but during snapshot meeting, we got some security concerns about this feature.

I think we need to be clear about what use cases we plan to support, and whether the use cases are reasonable to make the decision. If case 2 is not something we want or should support reasonably, I do feel two objects are not very necessary. Although we are saying it is similar to PV/PVC case, but snapshot is closer to dynamic provisioning case of PVC/PV which does not really require two objects by itself.

@tsmetana
Copy link
Member

tsmetana commented May 2, 2017

I don't have that strong preference for the two-object model as it may seem. I originally proposed one object model too. The two object model emerged from the upstream discussion for the reasons I have mentioned earlier: mostly it looks to be less limiting for future development.

If case 2 is not something we want or should support reasonably

This I believe is the important point: The two-object model is unnecessary in most cases. However the data replication use case comes popping up in every discussion. The two-object model actually tells users we're not writing the use case off and we want to try to resolve it. I'm also not sure how feasible or reasonable it is...

@cantbewong
Copy link

I question whether scheduled create or delete should be in the API. Once you enable snapshots to be created by API, you can implement scheduled snapshots, including periodically scheduled snapshots. SO I don't think this level of detail needs to be, or should be called out unless you propose that the scheduled snap mechanism is part of Kubernetes itself, which I do not think is a requirement, now, and perhaps forever. Warning: if you declare yourself to be the “one true standard” for implementing this, be advised than with complexities such as timezones for one, this is a huge undertaking.

Also need to API expose snapshot delete API BTW in order to make this useful for common applications. And obviously and API to ascertain the count and identity of snapshots is also required. In order to support applications engaing in recurring snapshot create/delete, it is desirable to allow such an application to appy a label or other unique searchable/querable identifier to allow determination of whether your application created a snapshot. At a practical level you don't want to have a backup service deleting an admin created snapshot simply because it can't discriminate based on who creted the snapshot.

In order to support application initiated (as presumable application consistent) snapshots it would be desirable to have a mechanism to enable/disable this (by either pod or container), with an API authentication mechanism that simply identifies the request as coming from within the pod/container without requiring further credentials - the ask here is to enable making snapshot create
/delete calls on “self” without keeping credentials inside the pod/container. (Database admin runs script inside container that quieisces app, flushes to file file system and then triggers snapshot)

I am worried that support of a list of volumes in the snapshot create api will be erroneously interpreted as an expectation that these snaps would be captured at a “singular and common” point in time. Few (and perhaps zero) underlying storage implementations are capable of delivering this in a Kubernetes context.

I question whether bundling “create new volume from snapshot” into the SnapshotRequest is desirable. Common use cases exist for creating a volume from a snapshot long after the original snapshot was created. If as a developer you decide you want to do this, I think finding the functionality in something labeled SnapshotRequest is unexpected.

@jeffvance
Copy link
Contributor

@cantbewong wrote: Common use cases exist for creating a volume from a snapshot long after the original snapshot was created. If as a developer you decide you want to do this, I think finding the functionality in something labeled SnapshotRequest is unexpected
I agree re. naming.

@mcluseau
Copy link
Contributor

mcluseau commented May 3, 2017

@jingxu97 even in case 1, how do you hide the storage access keys from the user without 2 objects? ie, accessing a Ceph snapshot will require a monitor list, a key, a pool, etc. I don't want my users to have access to that.

@jingxu97
Copy link
Contributor Author

jingxu97 commented May 3, 2017 via email

@mcluseau
Copy link
Contributor

mcluseau commented May 3, 2017

You have the whole subject really well in mind :-) So this is a technical solution implying RBAC on subresources. It's 1 object in etcd, but 2 ones in the API anyway. Pure opinion here: I find the subresource approach much more complex to understand for the admin/user; I mean, why should I do kubectl get snapshotstorage --all-namespaces (and deduplicate the output if a snapshot is shared between namespaces) to get a list of actual "real world" objects that are not namespaced by nature? I completely fail to see any gain and still see a potential danger in forcing such an artificial property. In particular, we're kind of locking the snapshots to their namespace while they are not in reality.

@jingxu97
Copy link
Contributor Author

jingxu97 commented May 3, 2017 via email

@mcluseau
Copy link
Contributor

mcluseau commented May 3, 2017

Kubernetes does support Ceph RBD (Rados Block Device), it's what I use :) https://github.com/kubernetes/kubernetes/tree/master/examples/volumes/rbd.

The ceph workflow is around this:

# Create a volume (ie the provisionner does that)
rbd create [key, mons, pool args] my-image --size 1024

# map and use the image...
rbd map [key, mons, pool args] my-image
=> /dev/rbd0
mount /dev/rbd0 /var/lib/kubelet/plugins/...rbd/...

# Take a snapshot
fsfreeze -f /dev/rbd0
rbd snap [key, mons, pool args] create my-image@snap-1
fsfreeze -u /dev/rbd0

# List snapshots of an image
rbd snap [key, mons, pool args] ls my-image

# Create a volume from the snapshot
rbd clone [key, mons, pool args] my-image@snap-1 my-image-from-snap-1

The key/user are usually those given to the "volume manager" (ie cinder in openstack). So, if "the user" is Kubernetes, then yes, the user has to provide its key to make operations on the images. But if "the user" is the Kubernetes' user, then she doesn't need to provide this information (there's no link between the Ceph's user and the Kubernetes' one). So, Kubernetes will have access to the snapshots it does, whatever the namespace or any Kubernetes' concept is involved.

In the PV/PVC split, the key, mons and pool args are in the PV, provided for instance by the provisionner.

@rootfs
Copy link
Contributor

rootfs commented May 3, 2017

@MikaelCluseau good point. So in rbd PV/PVC, there are two keys: admin and user. Admin key is used to create and delete volume while user key is used to map/unmap rbd image. We can use the same idea here: a admin key in its own namespace and let controller use it to snap/clone.

@timothysc
Copy link
Member

/cc @skriss

@erictune
Copy link
Member

I had a long chat with @jingxu97 about this today. She asked me to summarize here. I haven't read this whole issue, so apologies in advance if I misunderstand or state what has already been said.

In Kubernetes RBAC, it is desirable, and I think common, to give a user within a namespace permission to access all resources of all objects within that namespace, including subresources.

(Details: this happens by giving that user the cluster-admin role via a RoleBinding in that namespace, which effectively makes it a namespace-admin role, although that is not what it is called. See: https://kubernetes.io/docs/admin/authorization/rbac/#default-roles-and-role-bindings)

So, there is no way to prevent at this namespace-admin user from having access to all subresources of a namespaced Snapshot Kind.

Therefore, there is no way from preventing someone in namespace N1 from editing a Snapshot object to point to a path with data not belonging to that namespace.

(You might ask why not change the role definition. The problem is that then the role definition would either have to have an explicit deny in it (which RBAC intentionally does not support because it has other problems) or the role would have to enumerate every kind and subresource except the one sensitive one, which is not possible, since the set of kinds is unbounded.)

The best approach, from our conversation, seemed to be either the two Kinds approach or storing some private-to-the-controller information in an external data store (effectively two objects).

@jeffvance
Copy link
Contributor

Is it an option, at least initially if not permanently, to disallow all editing of a snapshot? In a way, this is consistent with the basic concept of snap-shotting...

@tsmetana
Copy link
Member

Thanks for the updates. I think the two objects would be the safest path forward then: it looks to me it might be really more extensible in the future...

@erictune
Copy link
Member

On further thought, my comment above may not be accurate. I've asked the RBAC experts in sig-auth for clarification.

@erictune
Copy link
Member

I spoke with Sig Auth. The leads have a different view of how the namespace admin role is going to be extended than I had. Their view is consistent with the subresources approach. So, I retract my previous objection. Apologies if this slowed things down.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests