KEP: StoragePool API for Advanced Storage Placement #1347

cdickmann · 2019-10-31T15:57:26Z

This KEP aims to extend Kubernetes with a new storage pool concept which enables the underlying storage system to provide more control over placement decisions. These are expected to be leveraged by application operators for modern scale out storage services (e.g. MongoDB, ElasticSearch, Kafka, MySQL, PostgreSQL, Minio, etc.) in order to optimize availability, durability, performance and cost.

The document lays out the background for why modern storage services benefit from finer control over placement, explain how SDS offers and abstracts such capabilities, and analyses the gaps in the existing Kubernetes APIs. Based on that, it derives detailed Goals and User Stories and proposes to introduce a new StoragePool CRD and suggets how to use existing StorageClass for how to steer placement to these storage pools.

k8s-ci-robot · 2019-10-31T15:57:34Z

Welcome @cdickmann!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2019-10-31T15:57:34Z

Hi @cdickmann. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-10-31T15:57:40Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: login-issues@jira.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2019-10-31T15:57:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cdickmann
To complete the pull request process, please assign saad-ali
You can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-storage/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xing-yang · 2019-11-01T20:21:22Z

keps/sig-storage/20191012-storagepool-advplacement.md

+[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
+[kubernetes/website]: https://github.com/kubernetes/website
+
+-->


I think all these comments above can be removed

pohly · 2019-11-04T11:38:17Z

keps/sig-storage/20191012-storagepool-advplacement.md

+  - node2
+  capacity:
+    total: 124676572
+```


Who installs and updates this CRD?

Can API extensions that core Kubernetes components (like the Kubernetes scheduler, should it start to use this information) depend on a CRD or does it have to be built into the apiserver?

This is an implementation detail, but it's still something that needs to be decided. We've gone from CRDs to in-tree APIs before (CSINodeInfo, CSIDriverInfo) and that caused quite a bit of extra work - it would be nice to avoid that by picking the right solution from the start.

I think this CRD can be installed in the similar way as the snapshot beta CRD. It can be installed by the cluster addon manager.

Do you have some pointers for me about that? Are there language bindings involved and how does Kubernetes pull those in?

We discussed this a bit on the #csi Slack channel:

the snapshot beta CRD is not used by core Kubernetes, so it's not an example that applies here

the addon manager that @xing-yang referred to is https://github.com/kubernetes/kubernetes/tree/master/cluster/addons, which is deprecated and not a general solution for installing a CRD

To me it still looks like core API extensions need to be done in apiserver.

The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.

Unless if we can prove that we can't use CRDs to achieve our goals, we can't explain to sig-architecture why StoragePool has to be in-tree APIs.

The reason that CSINode was transitioned from CRDs to in-tree APIs was mainly because kubelet depends on it.

The same applies here once we want to use the capacity information also for scheduling pods because then the Kubernetes controller depends on the capacity API - see #1353.

pohly · 2019-11-04T11:39:04Z

keps/sig-storage/20191012-storagepool-advplacement.md

+apiVersion: storagepool.storage.k8s.io/v1alpha1
+kind: StoragePool
+metadata:
+  name: storagePool1


Who chooses the name and how? It's distributed, so there has to be a naming scheme that avoids conflicts.

Thats a good question. Do we have precedent for the infrastructure surfacing up entities and having to give unique names? I would assume PV names fall into that?

PVs get unreadable names like pvc-24fcddf6-9178-4fe4-b71f-4417652bca5d. One has to find the right object through a reference elsewhere (PVC.Spec.VolumeName is set to it) or by searching for the right one by its attributes.

OK, yeah, thats likely not a great path to follow :)

@cdickmann - for load balancer services, kubernetes relies on service uid field to name/create resources at infrastructure provider level, also +1 on PVs

In the "New external controller for StoragePool" section, we mentioned that CSI driver will implement a ControllerListStoragePools() function. The controller will call that CSI function at startup to discover all pools which include names of the pools on the storage systems. We should probably consider appending a uuid to those names to avoid name collision.

pohly · 2019-11-04T11:41:04Z

keps/sig-storage/20191012-storagepool-advplacement.md

+spec:
+  driver: myDriver
+  parameters:
+     csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue


If these parameters are opaque to anyone other than the CSI driver itself, how are they going to be used?

The section "Using StorageClass for placement" discusses that applications should copy these into the StorageClass, until a separate KEP automates placement.

Sorry, somehow I missed that.

Which binding mode is that StorageClass meant to have? For the intended use case it'll probably be the normal immediate binding mode, right?

With delayed binding mode you would be back at the problem described in #1353: a node might get chosen without Kubernetes knowing whether the node is suitable for the intended StoragePool.

That is correct, let me say that more explicitly. I added "Most commonly this will lead to a "compute-follows-storage" design, i.e. it won't use WaitForFirstConsumer and instead have the PV be created first, and then have the Pod follow it based on which nodes the volume is accessible from."

pohly · 2019-11-04T11:50:34Z

keps/sig-storage/20191012-storagepool-advplacement.md

+
+#### CSI changes
+
+* Add a “StoragePoolSelector map[string]string” field in CreateVolumeRequest.


Can you clarify where the external-provisioner gets those values when calling CreateVolumeRequest? The text implies that this comes from the PVC, but then that's another Kubernetes API change that would have to be defined in this KEP.

I just noticed that you removed some PVC extension in commit 398a40e. I think that PVC extension needs to be added back. Without it, this section here and the parameters field in the CRD are hard to understand.

You could also remove those, but is then enough content left in this KEP to understand how the new API is meant to be used?

Indeed this is left over. I am removing this line. Sections "Usage of StoragePool objects by application" and "Using StorageClass for placement" should make clear how applications can construct a StorageClass from the StoragePool that allows for placement into the pool. The KEP also expresses that it expect future work in this area to further automate/simplify placement.

pohly · 2019-11-04T12:04:31Z

keps/sig-storage/20191012-storagepool-advplacement.md

+
+#### New external controller for StoragePool
+
+CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD. 


Do you mean a new sidecar? Does that new API go into the controller service and thus the new sidecar runs alongside that?

Note that CSI driver's which only support ephemeral inline volumes don't need to implement a controller service. In #1353 I propose to require that they provide GetCapacity and to run external-provisioner on each node in a mode where it just takes that information and puts it into the capacity object for the driver.

The decision to extend external-provisioner was based on the observation in the Kubernetes-CSI WG that the proliferation of sidecars is making the release process more complicated and a waste of resources because typically they do get deployed together.

This needs to be discussed further. One solution is to let external-provisioner handle it. Another solution is to have a separate storage pool sidecar handle it. The third solution is to follow the snapshot controller split model and have a common storage pool controller and a sidecar.

Personally I have no preference and will go with the guidance of the rest of the community.

As pointed out in kubernetes#1347, there may be other attributes than just capacity that need to be tracked (like failures). Also, there may be different storage pools within a single node. Avoiding "capacity" in the name of the data structures and treating it as just one field in the leafs of the data structure allows future extensions for those use cases without a major API change. For the same reason all fields in the leaf are now optional, with reasonable fallbacks if not set.

yastij · 2019-11-06T15:58:09Z

/cc

andrewsykim · 2019-11-06T16:02:25Z

keps/sig-storage/20191012-storagepool-advplacement.md

+  parameters:
+     csiBackendSpecificOpaqueKey: csiBackendSpecificOpaqueValue
+status:
+  accessibleNodes: 


Have we considered label selector for nodes instead of list of names? Maybe allow both (not together though)?

Oh, somehow I managed to miss this field during my first pass through this KEP. I agree with @andrewsykim, relying exclusively on a list of node names is not going to scale for large clusters for pools which are accessible by all nodes.

In #1353 I propose to use a v1.NodeSelector exactly for this reason.

accessibleNodes here is under status so it should be a list of node names as the driver should know what nodes this pool is accessible to.

We could have a node selector in spec.

Ah good point about status. I don't see accessibleNodes in spec, is the consumer of StoragePool API supposed to know that without input from the storage pool API?

All, I don't quite follow the discussion. StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?

But why do we need the node selector in the spec? What would it express? Similar question for Andrew. My expectation is that the consumer (an application) would read the status of the storage pool and hence know the nodes.

You are probably thinking of pools which only have a very short list of nodes. That may be valid for the intended purpose of this KEP.

But I am looking at it from the perspective of #1353 where a StoragePool might be accessible from one half of a very large cluster. Then listing all nodes is not going to be very space-efficient. I also don't know how often nodes are added or removed - such changes might then force updating the StoragePool.

I think the ideal API should support both: NodeSelector and lists of node names. Then whoever creates those objects can choose the method that is more suitable.

StoragePool is generated by code, in fact coming from CSI information. How would the CSI driver construct a NodeSelector, it doesn't really know the node labels, right?

The CSI driver also doesn't know what the nodes are called in Kubernetes. So whoever calls the CSI driver will have to translate between the naming used by the CSI driver and the naming used by Kubernetes.

OK, so you are suggesting that the controller may do something like add a label to all the nodes in "row 1" and then set up a NodeSelector accordingly. This way, the size of the StoragePool and any Pod that tries to affinititze to the nodes doesn't have a huge list to work off of. I think that makes sense. I will update the proposal accordingly.

cdickmann · 2019-11-12T10:39:23Z

One more try for CLA

dims · 2019-11-12T12:06:19Z

w00t! authorized

yastij · 2019-11-12T12:28:10Z

/cc @thockin

The "user stories" and goals section gets revamped to include more examples and incorporate the idea behind kubernetes#1347. Storage capacity also needs to be updated for snapshotting or resizing.

vishnuitta · 2020-01-16T20:24:17Z

keps/sig-storage/20191012-storagepool-advplacement.md

+CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD. 
+
+#### Usage of StoragePool objects by application
+Storage applications (e.g. MongoDB, Kafka, Minio, etc.), with their own operator (i.e. controllers), consume the data in the StoragePool. It is how they understand topology. For example, let’s say we are in User Story 2, i.e. every Node has a bunch of local drives each published as a StoragePool. The operator will understand that they are “local” drives by seeing that only one K8s Node has access to each of them, and it will see how many there are and of which size. Based on the mirroring/erasure coding scheme of the specific storage service, it can now translate to how many PVCs, which size and on which of these StoragePools to create. Given that the StoragePool may model something as fragile as a single physical drive, it is a real possibility for a StoragePool to fail or be currently inaccessible. The StoragePool hence also communicates that fact, and the Storage Service operator can understand when and why volumes are failing, and how to remediate (e.g. by putting an additional PVC on another pool and moving some data around). 


Instead of one storagepool per local drive, is there possibility of grouping set of local drives which then makes sense as pool?
If app operator need to really use storagePool which is one per local drive, instead, it can actually create a PV out of this local drive and use the same in app deployment/STS.

…l KEP

xing-yang · 2020-01-24T14:32:11Z

/ok-to-test

k8s-ci-robot · 2020-01-24T14:33:20Z

@cdickmann: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-enhancements-verify	`1fe6996`	link	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

fejta-bot · 2020-04-23T14:35:16Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

xing-yang · 2020-04-23T14:40:03Z

/remove-lifecycle stale

fejta-bot · 2020-07-22T14:42:42Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-08-21T15:22:47Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

xing-yang · 2020-09-10T13:46:02Z

/lifecycle frozen

fejta-bot · 2020-09-10T13:55:34Z

Enhancement issues opened in kubernetes/enhancements should never be marked as frozen.
Enhancement Owners can ensure that enhancements stay fresh by consistently updating their states across release cycles.

/remove-lifecycle frozen

fejta-bot · 2020-12-09T14:18:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-08T15:03:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-02-07T15:48:26Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-02-07T15:48:34Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2019

k8s-ci-robot requested review from childsb and saad-ali October 31, 2019 15:57

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 31, 2019

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Oct 31, 2019

cdickmann added 2 commits November 1, 2019 09:24

InitialDraft: StoragePool API for Advanced Storage Placement

17f9bc8

Remove new PVC field, and hence related alteratives from KEP

398a40e

cdickmann force-pushed the cdickmann-storagepool-kep branch from fba9e3f to 398a40e Compare November 1, 2019 09:24

xing-yang reviewed Nov 1, 2019

View reviewed changes

pohly mentioned this pull request Nov 4, 2019

Storage Capacity Constraints for Pod Scheduling #1353

Merged

pohly reviewed Nov 4, 2019

View reviewed changes

k8s-ci-robot requested a review from yastij November 6, 2019 15:58

andrewsykim reviewed Nov 6, 2019

View reviewed changes

pohly pointed out a left-over line from previous version

8e05ac5

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 12, 2019

k8s-ci-robot requested a review from thockin November 12, 2019 12:28

Explain role of WaitForFirstConsumer

f8207bc

vishnuitta reviewed Jan 16, 2020

View reviewed changes

First set of changes to phrase this KEP on top of the core StoragePoo…

1fe6996

…l KEP

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 24, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 10, 2020

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 10, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 9, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 8, 2021

k8s-ci-robot closed this Feb 7, 2021


		#### CSI changes

		* Add a “StoragePoolSelector map[string]string” field in CreateVolumeRequest.


		#### New external controller for StoragePool

		CSI will be extended with a new ControllerListStoragePools() API, which returns all storage pools available via this CSI driver. A new external K8s controller uses this API upon CSI startup to learn about all storage pools and publish them into the K8s API server using the new StoragePool CRD.

KEP: StoragePool API for Advanced Storage Placement #1347

KEP: StoragePool API for Advanced Storage Placement #1347

Conversation

cdickmann commented Oct 31, 2019

k8s-ci-robot commented Oct 31, 2019

k8s-ci-robot commented Oct 31, 2019

k8s-ci-robot commented Oct 31, 2019

k8s-ci-robot commented Oct 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij commented Nov 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdickmann commented Nov 12, 2019

dims commented Nov 12, 2019

yastij commented Nov 12, 2019

Choose a reason for hiding this comment

xing-yang commented Jan 24, 2020

k8s-ci-robot commented Jan 24, 2020

fejta-bot commented Apr 23, 2020

xing-yang commented Apr 23, 2020

fejta-bot commented Jul 22, 2020

fejta-bot commented Aug 21, 2020

xing-yang commented Sep 10, 2020

fejta-bot commented Sep 10, 2020

fejta-bot commented Dec 9, 2020

fejta-bot commented Jan 8, 2021

fejta-bot commented Feb 7, 2021

k8s-ci-robot commented Feb 7, 2021

xing-yang Nov 12, 2019 •

edited

Loading