storage capacity: PRR and beta review #1829

pohly · 2020-05-29T12:45:34Z

This marks items as completed which are (or soon will be) done.
PRR review will happen as part of merging this PR.

wojtek-t · 2020-05-29T15:07:31Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
 - [X] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
+- [X] (R) Production readiness review completed


Sorry - this is super invonvenient for both you and me, but given I can't comment on lines you didn't change I can only do that here:

<unrelated to PRR itself, though I wanted to understand some stuff>

L583: you're saying that for DaemonSet deployment the owner will be a pod. That seems really suspicious. What if I need to update the spec of DS? The pods will get recreated, the objects will get deleted and the new ones will get crated? That doesn't sound like the best option to me...
I think it should be the node (and potentially some controller that will be removing that OwnerRef if the node is no longer targetted by the driver...)
L758: if I disable the feature and never reenable again, then those object will keep staying in etcd. It's probably fine, because we expect to graduate it anyway, but probably worth pointing that out explicitly.
L781: Can you add explicit "Will be added before the transition to beta." (as you added later).

L824: While scalability is not required for Alpha, I really recommend you to think about this question: "*Will enabling / using this feature result in any new API calls?" now.
If we implement it badly, it may not scale in large clusters (overload apiserver/etcd) which would block transition to GA. It's fine to do shortcuts/inefficient implementations for Alpha, but if you do that unconsciously it may require significant unneded changes earlier.
If you say "no", I can't block it, but I highly encourage you to spend a bit time on thinking about it and fill this in.

What if I need to update the spec of DS? The pods will get recreated, the objects will get deleted and the new ones will get crated?

Correct.

That doesn't sound like the best option to me... I think it should be the node (and potentially some controller that will be removing that OwnerRef if the node is no longer targetted by the driver...)

Such a controller would have to be part of the driver deployment because Kubernetes itself has no knowledge about how the driver is generating these objects. This then brings us back to the "what if the driver has no chance to clean up" part: simply removing the driver and its operator may leave stale objects behind. I'm not sure how common this mode of operation will be (it was primarily meant for local ephemeral volumes, but now those are going to be handled via "generic ephemeral inline volumes"), so I'd rather keep it very simple and refine later if needed.

L758: if I disable the feature and never reenable again, then those object will keep staying in etcd. It's probably fine, because we expect to graduate it anyway, but probably worth pointing that out explicitly.

So they won't be garbage-collected because the type is disabled? Hmm, I already had such a warning in the paragraph: "However, the new objects will still remain in the etcd database." I've added a "..., they just won't be visible until the feature is enabled again."

L781: Can you add explicit "Will be added before the transition to beta." (as you added later).

Done.

L824: While scalability is not required for Alpha, I really recommend you to think about this question: "*Will enabling / using this feature result in any new API calls?" now.

This has been discussed extensively. I could try to summarize all of those discussions now, but the result would probably be incomplete and/or out-dated because we intend to do performance evaluations of the alpha implementation and then change the implementation before beta. So I'd rather skip this part for alpha.

What if I need to update the spec of DS? The pods will get recreated, the objects will get deleted and the new ones will get crated?

Correct.

I don't think this is a behavior we really want. I don't want to block the whole effort on that, but I would like the item to figuring out the behavior of disappearing capacity object on DS updates to really be solved for Beta.

So please add that to the Beta graduation criteria.

L610/L596: why those objects need to be periodically updated? It seems that we need to fetch data from drivers periodically, but if nothing has changed, the updates shouldn't even be sent (we should be watching those objects on provisioner side and if the value is the same in cache, just skip updates)

This has been discussed extensively. I could try to summarize all of those discussions now, but the result would probably be incomplete and/or out-dated because we intend to do performance evaluations of the alpha implementation and then change the implementation before beta. So I'd rather skip this part for alpha.

I think that scalability of this feature is potentially questionable, so I would really like to avoid having something that won't work by-design. I'm fine with not having all the answers there, but at least answering the first question in this section (any new API calls) - should be answered at this point.

I don't think this is a behavior we really want. I don't want to block the whole effort on that, but I would like the item to figuring out the behavior of disappearing capacity object on DS updates to really be solved for Beta.

So please add that to the Beta graduation criteria.

Okay, but let's make it an item about deciding what the desired behavior is and then implementing that.

why those objects need to be periodically updated? It seems that we need to fetch data from drivers periodically, but if nothing has changed, the updates shouldn't even be sent (we should be watching those objects on provisioner side and if the value is the same in cache, just skip updates)

Correct, and that's how it is implemented - there is regular polling, but objects only get updated if changed. I'll update the text to make that clear.

I think that scalability of this feature is potentially questionable

The problem is that scalability may be fine for some use-cases and not for others. I think adding such a feature is worthwhile, it just has to be well-understood and document what the feature is good for.

I'm fine with not having all the answers there, but at least answering the first question in this section (any new API calls) - should be answered at this point.

Okay, I'll try to add something.

I thought a bit more about this and came to the conclusion that the owner reference handling needs to become optional. We can kill several birds with one stone here: support drivers that are deployed outside of the cluster, reduce watch traffic by filtering which CSIStorageCapacity objects need to be sent to each external-provisioner instance, and avoid recreating all objects during a DaemonSet update. For that we need to introduce labels.

@msau42: can you look at the "storage capacity: optional owner reference + watch filter" commit (currently 33e9f28)? Does this make sense?

Yes, that sounds good to me!

pohly · 2020-06-02T07:04:11Z

@wojtek-t: PR updated, please take another look.

fejta-bot · 2020-08-31T07:09:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

pohly · 2020-08-31T10:18:26Z

/remove-lifecycle stale

deads2k · 2020-10-05T19:37:26Z

I came across this trying to dig through old PRR reviews. Can you confirm this is alpha in 1.19 and planned for beta in 1.21?

Also, can you add a trailing space or something to each PRR line so we can see them all? I expanded sections and did not see details around metrics, which I would really like to have.

pohly · 2020-10-05T19:53:24Z

I came across this trying to dig through old PRR reviews. Can you confirm this is alpha in 1.19 and planned for beta in 1.21?

Correct. I know that we are late with this PR, and I already pinged @wojtek-t on Slack to get it moving again.

Also, can you add a trailing space or something to each PRR line so we can see them all? I expanded sections and did not see details around metrics, which I would really like to have.

So far I only filled in the feature gate section in: https://github.com/kubernetes/enhancements/blob/3148611e140dd3af05de796dabc672e356e02ca1/keps/sig-storage/1472-storage-capacity-tracking/README.md#production-readiness-review-questionnaire

With metrics, you were probably thinking of the "Monitoring requirements"? That part still needs to be added... for beta.
My goal right now with this PR is just to get the requirements for alpha fulfilled; alpha didn't need those parts yet.

fejta-bot · 2021-01-03T20:11:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

pohly · 2021-01-03T20:42:51Z

/remove-lifecycle stale

wojtek-t · 2021-01-18T14:11:50Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
 - [X] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
+- [X] (R) Production readiness review completed


What if I need to update the spec of DS? The pods will get recreated, the objects will get deleted and the new ones will get crated?

Correct.

I don't think this is a behavior we really want. I don't want to block the whole effort on that, but I would like the item to figuring out the behavior of disappearing capacity object on DS updates to really be solved for Beta.

So please add that to the Beta graduation criteria.

L610/L596: why those objects need to be periodically updated? It seems that we need to fetch data from drivers periodically, but if nothing has changed, the updates shouldn't even be sent (we should be watching those objects on provisioner side and if the value is the same in cache, just skip updates)

This has been discussed extensively. I could try to summarize all of those discussions now, but the result would probably be incomplete and/or out-dated because we intend to do performance evaluations of the alpha implementation and then change the implementation before beta. So I'd rather skip this part for alpha.

I think that scalability of this feature is potentially questionable, so I would really like to avoid having something that won't work by-design. I'm fine with not having all the answers there, but at least answering the first question in this section (any new API calls) - should be answered at this point.

wojtek-t · 2021-01-18T14:17:29Z

keps/sig-storage/1472-storage-capacity-tracking/README.md


  When disabling support in the API server, the new object type and
-  thus all objects will disappear together with the new flag in
+  will disappear together with the new flag in


What exact behavior the feature gate is controlling on the kube-apiserver side?

It seems to me that kube-apiserver is more treated to passing the state and nothing really happens there...

Another question is - how we can control the behavior in the external-provisioner side.

What exact behavior the feature gate is controlling on the kube-apiserver side?

There's no specific logic in kube-apiserver besides offering the API. It has to be a builtin type because there was (and IMHO still is) no way to to ensure that a Kubernetes cluster has a CRD installed. Another unsolved (?) issue is generating Go bindings for a CRD that then can be used by kube-scheduler.

Another question is - how we can control the behavior in the external-provisioner side.

There are command line flags for it: https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support

Cluster admins can influence that when deploying a CSI driver.

OK - I think we need to distringuish two different things in this question:

when the feature is disabled itself (but no apiserver rollback - so API is still there)

when there is version rollback (so API is also rolled back)

Can you make those a bit more explicit.
[It's a different thing, but might be work looking into this one: https://github.com//pull/2199/files#diff-b18c53698772d9f0dcc15e9f35d90ecd0f181923f33e40115036991aa0b5f1c5R236 ]

Depending on how thoroughly the feature gets disabled, the API is not there anymore. It will be gone when any of the following happens for the apiserver:

the v1alpha1 API group gets disabled (not relevant anymore once it is beta)

the feature is turned off for the apiserver

the apiserver is rolled back to a version without the API

In this section, I already tried to discuss disabling the feature in the three places where it could be disabled separately (apiserver, scheduler, CSI driver). I'm not sure what to add?

wojtek-t · 2021-01-18T14:18:10Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

  `CSIDriver`, which will then cause kube-scheduler to revert to
  provisioning without capacity information. However, the new objects
-  will still remain in the etcd database.
+  will still remain in the etcd database, they just won't be visible.


Hmm - so what exactly is happening. Will the CRD itself be deleted?
If so, the underlying CRs will also get deleted, no?

Actually - is it CRD or is it a built-in API? I think it's not explicitly mentioned anywhere...

Built-in, see #1829 (comment).

I've clarified that at the top of the API section.

wojtek-t · 2021-01-28T13:11:25Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+CRD because kube-scheduler will depend on this API and ensuring that a
+CRD gets installed and updated together with the core Kubernetes is
+still an unsolved problem.
+


nit: fix indentation for types defined below

wojtek-t · 2021-01-28T13:19:12Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

-While external-provisioner runs, it needs to update and potentially
+The owner should be the higher-level app object which defines the provisioner pods:
+- the Deployment or StatefulSet for central provisioning
+- the DaemonSet for distributed provisioning


I'm glad to see this was changed (comparing to what it initially was).

wojtek-t · 2021-01-28T13:20:24Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+there are multiple instances with leadership election or the driver
+gets upgraded.
+
+This ownership reference is optional. It gets enabled with


What is the reason for making it optional?

It can't be mandatory because drivers that aren't deployed inside Kubernetes have nothing that can serve as owner.

We could remove it complete, but I find it useful because admins don't need to worry about manually removing objects after removing the "normal" driver objects that they know about (StatefulSet, DaemonSet, etc.).

That leaves keeping it as an optional feature.

Would be useful to provide this context in the KEP itself.

wojtek-t · 2021-01-28T13:21:53Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+`CSIStorageCapacity` objects after uninstalling a CSI driver. To
+simplify that and for efficient access to objects, external-provisioner
+sets some labels:
+- `app.kubernetes.io/instance`: the CSI driver name


This is the decision of API approvers, but I'm not a fan of this name.
Can we change that so that it will be clear that it's about csi-driver name?

I picked app.kubernetes.io because it seemed useful to follow conventions; in this case, the CSI driver is the "app". But I don't have a strong opinion about it. Some other, external-provisioner specific label also works. How about external-provisioner.k8s.io/drivername?

It will be API approver decision, but I like the external-provisioner.k8s.io/drivername much more :)

Let me add that these labels aren't part of the CSIStorageCapacity API, nor was the owner reference. The scheduler doesn't know about them and doesn't use them.

Instead, this is an implementation detail of the external-provisioner. It needs to be documented as part of the external-provisioner because admins may need that information. The name of the labels could even be configurable, although I don't see a need for that.

Agree I think we should scope down the label key to be more specific and matching the format of other labels we've added for csi, like "csi.storage.k8s.io"

I've changed it to

csi.storage.k8s.io/drivername=<driver name> csi.storage.k8s.io/managed-by=external-provisioner|external-provisioner-<host name>

wojtek-t · 2021-01-28T13:22:12Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+simplify that and for efficient access to objects, external-provisioner
+sets some labels:
+- `app.kubernetes.io/instance`: the CSI driver name
+- `app.kubernetes.io/managed-by`: `external-provisioner` for central provisioning,


Same here - why "app"?

This then becomes external-provisioner.k8s.io/managed-by=central or external-provisioner.k8s.io/managed-by=<hostname>.

wojtek-t · 2021-01-28T13:46:09Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+Yes.
+
+Enabling it in apiserver and CSI drivers will cause
+`CSIStorageCapacity` objects to be created respectively updated. The


nit: "created and respective updated" ?

"created or updated"

wojtek-t · 2021-01-28T13:50:19Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+centralized provisioning, the number of segments is probably low. For
+distributed provisioning, the each node where the driver runs
+represents one segment. The rate at which objects depends on how often
+storage usage changes.


It would be good to be more specific, e.g.:

distributest provising:
#CSIStorageCapacity per each (node, storage-class) pair [is that right?]

For rate of changes, we can probably estimate it:

a creation whenever node is create

an update whenever: some volume is created/resized/deleted (i.e. bounded by volume api calls) or storage underneath is changed (which probably requires some operator action)

wojtek-t · 2021-01-28T13:51:14Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

  provider?**

+A CSI driver might have to query the storage backend more often to be
+kept informed about available storage capacity.


What are the plans? Can we ensure rate-limiting?

In distributed-mode, can we ensure that we won't overloade the provider (we had problems like that in the past).

Distributed provisioning is meant for CSI drivers which manage local storage and don't have a backend cloud provider. If they had one, there would be no reason to use distributed provisioning.

For central provisioning with a cloud provider we have rate limiting because objects that need to be updated or created go through a rate limiting workqueue.

Can you add this here (i.e. distributed is not meant to be used by cloud-provider stuff, and for central we have rate-limiting).

wojtek-t · 2021-01-28T13:51:36Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **Will enabling / using this feature result in increasing size or count
  of the existing API objects?**

+No.


CSIDriver by adding new field (it's fine - just make it explicit)

Oops, forgot about that one because it didn't matter when thinking about scalability. Fixed.

wojtek-t · 2021-01-28T13:52:11Z

keps/sig-storage/1472-storage-capacity-tracking/README.md


+No. Scheduling of pods without volumes are not expected to be
+affected. Scheduling of pods with volumes is not covered by existing
+SLIs/SLOs.


No SLO yet - but there is SLI:
https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md

Thanks for the link, I rewrote the section.

msau42 · 2021-01-29T18:40:06Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+existing CSI sidecar metrics support with the
+`csi_sidecar_operations_seconds_count` metric with labels
+`driver_name=<csi driver name>`, `grpc_status_code=OK`,
+`method_name=CreateVolume`.


There should be metrics for 'GetCapacity' rpc calls too.

True, but those might also come from the code which handles immediate binding when doing distributed provisioning, so merely having a non-zero count of GetCapacity calls does not imply that this feature here is in use.

To determine if the feature is working, it may be useful to see if GetCapacity calls are returning errors?

I agree it's not particularly useful on its own, but in combination with other metrics, we can use it to get a more complete picture of the various failure points.

I agree with Michelle. I think we should focus on this particular feature, not the whole "end-to-end volume provisioning".

I like the idea of GetCapacity rpcs. metrics. Plus the CSIStorageCapacity object existence.

We also have metrics for GetCapacity calls and can distinguish between successful and failing ones. I've focused on that instead of volume creation.

msau42 · 2021-01-29T18:43:10Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+* the same for adding or removing storage classes (both modes)
+* updates when volumes are created/resized/deleted (thus bounded by
+  some other API calls)
+* updates when capacity in the underlying storage system is changed


or when a PV gets provisioned

That's covered by "when volumes are created".

I don't see "when volumes are created" in this list. Can we be more explicit here?

It's mentioned above:
"when volumes are created/resized/deleted"

:)

sorry I need new glasses :)

msau42 · 2021-01-29T18:44:46Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **Will enabling / using this feature result in increasing size or count
  of the existing API objects?**

+One new field gets added to `CSIDriver`.


specify that it's a bool (so we get sense of size)

msau42 · 2021-01-29T18:45:40Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **What are other known failure modes?**

-* **What steps should be taken if SLOs are not being met to determine the problem?**
+The API server might get overloaded.


by CSIStorageCapacity updates?

That's clearer, added.

msau42 · 2021-01-29T18:46:46Z

keps/sig-storage/1472-storage-capacity-tracking/kep.yaml

  - "@saad-ali"
  - "@xing-yang"
 approvers:
  - "@saad-ali"


Can you also add my name here

Done. I've added you as reviewer and approver.

pohly

Pushed another revision.

I'm not squashing commits yet, but probably should before merging.

pohly · 2021-01-29T19:35:59Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+CRD because kube-scheduler will depend on this API and ensuring that a
+CRD gets installed and updated together with the core Kubernetes is
+still an unsolved problem.
+


pohly · 2021-01-29T19:36:14Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+there are multiple instances with leadership election or the driver
+gets upgraded.
+
+This ownership reference is optional. It gets enabled with


pohly · 2021-01-29T19:39:29Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+existing CSI sidecar metrics support with the
+`csi_sidecar_operations_seconds_count` metric with labels
+`driver_name=<csi driver name>`, `grpc_status_code=OK`,
+`method_name=CreateVolume`.


True, but those might also come from the code which handles immediate binding when doing distributed provisioning, so merely having a non-zero count of GetCapacity calls does not imply that this feature here is in use.

pohly · 2021-01-29T19:40:01Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+* the same for adding or removing storage classes (both modes)
+* updates when volumes are created/resized/deleted (thus bounded by
+  some other API calls)
+* updates when capacity in the underlying storage system is changed


That's covered by "when volumes are created".

pohly · 2021-01-29T19:40:10Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **Will enabling / using this feature result in increasing size or count
  of the existing API objects?**

+One new field gets added to `CSIDriver`.


pohly · 2021-01-29T19:41:14Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **What are other known failure modes?**

-* **What steps should be taken if SLOs are not being met to determine the problem?**
+The API server might get overloaded.


That's clearer, added.

pohly · 2021-01-29T19:42:10Z

keps/sig-storage/1472-storage-capacity-tracking/kep.yaml

  - "@saad-ali"
  - "@xing-yang"
 approvers:
  - "@saad-ali"


Done. I've added you as reviewer and approver.

wojtek-t

I would like to resolve some of the already addressed comments, but it seems I lost permissions to do that....

wojtek-t · 2021-02-01T10:30:49Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+In external-provisioner 2.0.0, this was achieved by checking the owner
+reference. Now that the owner reference is optional, it will get
+achieved by ignoring all objects that don't have the
+`csi.storage.k8s.io/managed-by` labels described above.


Can you please also explicitly write down what happens if such object without managed-by label exists?

I'm assuming:

external-provisioner does not create its own

it also doesn't change that manually managed ones

But I would like to see that explicitly mentioned.

wojtek-t · 2021-02-01T10:31:13Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+reference. Now that the owner reference is optional, it will get
+achieved by ignoring all objects that don't have the
+`app.kubernetes.io/managed-by` labels described above.
+


I don't know what changed - but I'm not longer able to resolve comments, even my own...

wojtek-t · 2021-02-01T10:34:20Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

 * **Can the feature be disabled once it has been enabled (i.e. can we rollback
  the enablement)?**
  Yes.



Comment for Line 821 "Does enabling the feature change any default behavior".

Actually - I think the answer should be "it doesn't".
Because technically, without any changes from the user (just cluster operator), pods will start to be scheduled on other nodes.
[This is supposed to be change for the better to reduce problematic cases, but there is a change.]

"Does enabling the feature change any default behavior" currently says "No changes for existing storage driver deployments." which is the same as "it doesn't".

Or did you mean to say that the answer should be "It does change default behavior"?

This depends on the interpretation of "the feature". My "no changes" was based on enabling "the feature" in Kubernetes, without enabling it in a CSI driver. Only once it is also enabled in a CSI driver, then there will be a change.

I'll put both in so that it is clearer.

wojtek-t · 2021-02-01T10:36:17Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

  the enablement)?**
  Yes.

  When disabling support in the API server, the new object type and


nit: remove "and "

wojtek-t · 2021-02-01T10:41:10Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+
+If the second phase fails because a driver malfunctions or overloads
+the apiserver, then it can be rolled back and scheduling again happens
+without using storage capacity information.


For this particular example, enabling it just in kube-apiserver should not fail in any way. It's just a new type that isn't getting used. Sorry for my lack of imagination

Imagine that e.g. conversion functions are broken for the type and they are crashing kube-apiserver. Then if there was a preexisting object, kube-apiserver won't even start...

I would probably mentioned two things in this case:

unexpected kube-apiserver restarts

increased number of 5xx errors from CSIStorageCapacity-related calls

wojtek-t · 2021-02-01T10:45:53Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+existing CSI sidecar metrics support with the
+`csi_sidecar_operations_seconds_count` metric with labels
+`driver_name=<csi driver name>`, `grpc_status_code=OK`,
+`method_name=CreateVolume`.


I agree with Michelle. I think we should focus on this particular feature, not the whole "end-to-end volume provisioning".

I like the idea of GetCapacity rpcs. metrics. Plus the CSIStorageCapacity object existence.

wojtek-t · 2021-02-01T10:46:32Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+provisioned volumes per second and few total number of volumes overall
+may be fine for some workloads (long-running applications which use
+specialize storage like PMEM) while others may need a much higher rate
+(adding a local LVM scratch volume to every pod in the system).


That sounds fine to me. But then let's remove this paragraph :)

wojtek-t · 2021-02-01T10:48:15Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+* the same for adding or removing storage classes (both modes)
+* updates when volumes are created/resized/deleted (thus bounded by
+  some other API calls)
+* updates when capacity in the underlying storage system is changed


It's mentioned above:
"when volumes are created/resized/deleted"

:)

wojtek-t · 2021-02-01T10:49:18Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+* the same for adding or removing storage classes (both modes)
+* updates when volumes are created/resized/deleted (thus bounded by
+  some other API calls)
+* updates when capacity in the underlying storage system is changed


It's mentioned above:
"when volumes are created/resized/deleted"

:)

wojtek-t · 2021-02-01T10:49:46Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

  provider?**

+A CSI driver might have to query the storage backend more often to be
+kept informed about available storage capacity.


Can you add this here (i.e. distributed is not meant to be used by cloud-provider stuff, and for central we have rate-limiting).

pohly

New commit pushed.

pohly · 2021-02-01T14:46:59Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+when the cluster is not really running out of storage capacity.
+
+Another is a degradation in apiserver metrics (increased CPU or memory
+consumption, increased latency).


Do we have documentation for those that I can look up? I don't have metrics gathering enabled for my local test cluster, so I cannot look this up easily.

pohly · 2021-02-01T14:47:05Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+
+If the second phase fails because a driver malfunctions or overloads
+the apiserver, then it can be rolled back and scheduling again happens
+without using storage capacity information.


pohly · 2021-02-01T14:53:03Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+The feature is used if all of the following points apply:
+- `CSIDriver.spec.storageCapacity` is true for a CSI driver.
+- There are storage classes for that driver.
+- Volumes are using those storage classes.


That would have to go into kube-scheduler. What should that metric count? Number of times that a storage capacity was the deciding factor between using or not using a node?

There are no metrics at all that expose why nodes are chosen as far as I know - or at least there are none in the part that is involved with volumes.

IMHO this needs more thought; a new "enhance metrics for volume scheduling" KEP might be a better place for it.

pohly · 2021-02-01T14:53:19Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

  provider?**

+A CSI driver might have to query the storage backend more often to be
+kept informed about available storage capacity.


wojtek-t

Few more comments.

wojtek-t · 2021-02-02T12:32:58Z

keps/sig-storage/1472-storage-capacity-tracking/README.md


-  When disabling support in the API server, the new object type and
-  thus all objects will disappear together with the new flag in
+  When disabling support in the API server, the new object type


We need to distringuish two things:

version rollback - in which case the type will no longer be server (but please be explicit it's about version rollback)

feature disablemenet (via feature gate) - in this case the type will still be visible

Perhaps it got lost in the GitHub noise, but I replied to that already here: https://github.com/kubernetes/enhancements/pull/1829/files/b039a48bffcab54f98975f4dfdb2df7e0f9944ad#r566714065

I said:

Depending on how thoroughly the feature gets disabled, the API is not there anymore. It will be gone when any of the following happens for the apiserver:

the v1alpha1 API group gets disabled (not relevant anymore once it is beta)

the feature is turned off for the apiserver

the apiserver is rolled back to a version without the API

In this section, I already tried to discuss disabling the feature in the three places where it could be disabled separately (apiserver, scheduler, CSI driver). I'm not sure what to add?

Just to be very clear, in the second case that you mentioned ("feature disablement (via feature gate)") the type will also be gone and not visible. So there is really no difference between that case and version rollback.

Just to be very clear, in the second case that you mentioned ("feature disablement (via feature gate)") the type will also be gone and not visible.

It shouldn't - we generally don't hide type registration behind feature gate.

Somehow this is common for storage types. For example: https://github.com/kubernetes/kubernetes/blob/73dd5c840662bb066a146d0871216333181f4b64/pkg/registry/storage/rest/storage_storage.go#L98-L105

@wojtek-t and I were discussing this on chat and weren't sure whether this is a good or bad thing. He wanted to clarify with Jordan.

In the meantime I have changed the first sentence to make it clear that registration is currently depending on the feature gate.

I just briefly chatted with Jordan about this offline.

Given that APIs are controlled via --runtime-config flag anyway, then having that controlled via feature gate doesn't bring any value. So the consensus we get to is that you basically should remove that feature gate from there.

[And then it will evolve to situation I was suggesting above.]

Please also add to the KEP somewhere that feature-gate will be removed from controlling API registration.

Given that APIs are controlled via --runtime-config flag anyway, then having that controlled via feature gate doesn't bring any value. So the consensus we get to is that you basically should remove that feature gate from there.

That argument currently applies because it is in the v1alpha1 group. But once we move it to v1beta1 in 1.20, it will become impossible to remove just that one type.

I can see pros and cons for both approaches: when the apiserver can disable the type, the cluster can be recovered from a malfunctioning CSI driver quickly by disabling the type and later fixing the CSI driver. The downside of disabling the type is that it is impossible to delete stale objects.

Overall I prefer to not have the feature gate check.

Please also add to the KEP somewhere that feature-gate will be removed from controlling API registration.

Done.

wojtek-t · 2021-02-02T12:35:19Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+when the cluster is not really running out of storage capacity.
+
+Another is a degradation in apiserver metrics (increased CPU or memory
+consumption, increased latency).


https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go#L98

wojtek-t · 2021-02-02T12:45:19Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+The feature is used if all of the following points apply:
+- `CSIDriver.spec.storageCapacity` is true for a CSI driver.
+- There are storage classes for that driver.
+- Volumes are using those storage classes.


OK - let me take a step back here.

Basically - this isn't a feature of a "running workload" per-se. It only affects on which node (if any) it will get scheduled.

So let's maybe add something like that. at the beginning.

As for the second point, I think the scheduler metric is exactly what we want.
I don't care that much if it was a deciding factor or not. But I would like to see if we were attempting to use to that information.

As a first step in this KEP, I was thinking about a metric like: "I (scheduler) wanted to check CSIStorageCapacity object and it didn't exist for that Node". But now I realize that it doesn't really have to be an error, because there may not be particular type of storage on that node, right?

So let's maybe make one more step back and think about producing CSIStorageCapacity object. In L662 you propose a central-provisioner that will be GC-ing orphaned objects. Can we extend it a bit so that:

it will be exposing a metric for number of desired CSIStorageCapacity object splitted by type

?

wojtek-t · 2021-02-02T12:45:51Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+provisioned volumes per second and few total number of volumes overall
+may be fine for some workloads (long-running applications which use
+specialize storage like PMEM) while others may need a much higher rate
+(adding a local LVM scratch volume to every pod in the system).


ping - let's remove this paragraph

pohly

New commit pushed.

pohly · 2021-02-02T14:33:18Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+provisioned volumes per second and few total number of volumes overall
+may be fine for some workloads (long-running applications which use
+specialize storage like PMEM) while others may need a much higher rate
+(adding a local LVM scratch volume to every pod in the system).


pohly · 2021-02-02T14:34:07Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+The feature is used if all of the following points apply:
+- `CSIDriver.spec.storageCapacity` is true for a CSI driver.
+- There are storage classes for that driver.
+- Volumes are using those storage classes.


Done. I left out more advanced metrices for scheduling because this needs more overall planning.

pohly · 2021-02-02T14:34:38Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+when the cluster is not really running out of storage capacity.
+
+Another is a degradation in apiserver metrics (increased CPU or memory
+consumption, increased latency).


Added that name as a link.

pohly · 2021-02-02T14:36:33Z

keps/sig-storage/1472-storage-capacity-tracking/README.md


-  When disabling support in the API server, the new object type and
-  thus all objects will disappear together with the new flag in
+  When disabling support in the API server, the new object type


Somehow this is common for storage types. For example: https://github.com/kubernetes/kubernetes/blob/73dd5c840662bb066a146d0871216333181f4b64/pkg/registry/storage/rest/storage_storage.go#L98-L105

@wojtek-t and I were discussing this on chat and weren't sure whether this is a good or bad thing. He wanted to clarify with Jordan.

In the meantime I have changed the first sentence to make it clear that registration is currently depending on the feature gate.

msau42 · 2021-02-02T21:30:37Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+achieved by ignoring all objects that don't have the
+`csi.storage.k8s.io/managed-by` labels described above.
+
+If such objects are created manually, then it is the responsibility of


We made CSICapacity objects namespaced to handle the ownerref right? Does that mean anyone with namespace permissions could interfere with the scheduling of other pods? We should double check with @kubernetes/sig-auth-pr-reviews on this security model.

This is a valid concern, but we've discussed this already. Permission to create CSIStorageCapacity objects is not normally granted to users in a namespace, so they cannot interfere with pod scheduling - neither of their own nor of pods in other namespaces.

Someone who can create CSIStorageCapacity objects has cluster admin permissions and while they might make mistakes, this is not a privilege escalation or denial of service attack.

Thanks, confirmed with Jordan that objects are opt-in to the default namespace editor policy

msau42 · 2021-02-02T21:31:28Z

keps/sig-storage/1472-storage-capacity-tracking/README.md


+For efficiency reasons, external-provisioner instances watch
+`CSIStorageCapacity` objects in their namespace and filter already in
+the apiserver by the `app.kubernetes.io/instance` and


update these labels

msau42 · 2021-02-02T21:39:20Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+following metric data that will be provided by external-provisioner instances:
+- total number of `CSIStorageCapacity` objects that the external-provisioner
+  is currently meant to manage for the driver
+- actual number of such objects that currently exist


do we need to determine what count of these objects are stale so that it doesn't look like we are 100% covered?

With "stale" you mean the topology/storage class pair, not the "freshness" of the capacity value, right?

That's doable. Instead of that one item with "that currently exist" I now have:

number of such objects that currently exist and can be kept because they have a topology/storage class pair that is still valid

number of such objects that currently exist and need to be deleted because they have an outdated topology/storage class pair

msau42 · 2021-02-02T22:13:49Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+* the same for adding or removing storage classes (both modes)
+* updates when volumes are created/resized/deleted (thus bounded by
+  some other API calls)
+* updates when capacity in the underlying storage system is changed


sorry I need new glasses :)

This marks items as completed which are (or soon will be) done for beta. PRR review will happen as part of merging this.

pohly · 2021-02-03T09:54:50Z

I've addressed @msau42's latest comments and squashed everything into a single commit.

wojtek-t · 2021-02-03T11:27:41Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

-  be treated with exponential backoff just like other communication
-  issues with the API server.
+  The new flag in `CSIDriver` will always be removed when disabling the
+  feature gate or when rolling back to a release older than 1.19. In that


re feature-gate disablement:
If the field is set and the FG was disabled, this shouldn't automatically delete the field - we should only drop the field if someone is trying to set it, but not if that already exists (unless someone explicitly removes it).

It's basically this pattern of "if FD disabled and field not in use", e.g.:
https://github.com/kubernetes/kubernetes/blob/074a5177209c3605daf11cfe6f477fddce0d55c0/pkg/api/pod/util.go#L450

Again, this is not how any of the other fields in storage objects work. That's what was agreed upon with the API reviewers. I am not sure why it is being done, but I am reluctant to change it without a wider discussion.

Can we merge the KEP as-is and I bring it up when doing the API change?

One of the reasons for having PRR is also to achieve predictibility and consistency across the system. Especially for Beta and GA features.
If we're doing something differently then anywhere else, I would like to understand the clear reasons for that and have that documented. If you write "I am not sure why it is being done", I would really like to hear API approvers voice now.

@liggitt - can you please chime in?

Allowing existing data to be preserved on update is required. This ensures that an n-1 server where the feature gate is not yet enabled by default will not cause data loss in a skewed multi-server cluster during upgrade. Please update that detail here to avoid confusion before merge.

this is not how any of the other fields in storage objects work

Do you have pointers to those? If we are dropping field values on update without regard for existing persisted data, we'll need to open issues to resolve that.

For reference, the ephemeral volume field is following the expected pattern correctly:

https://github.com/kubernetes/kubernetes/blob/074a5177209c3605daf11cfe6f477fddce0d55c0/pkg/api/pod/util.go#L568-L576

I disagree - scheduler should revert by just switching of feature gate.

And this generated a significant issue that:

I create an object with a field set

FG gets disabled (and that silently drops the field)

FG gets reenabled, and the object I created with a field set has no longer the field set.

I don't know why this wasn't caught in the review, but that sounds like an omission. I don't think that justifies leaving that for longer. We should fix that (and do the same thing we're doing everywhere else).

Do you have pointers to those? If we are dropping field values on update without regard for existing persisted data, we'll need to open issues to resolve that.

This is the commit which added the additional field: kubernetes/kubernetes@1089954#diff-1918b63c53bdbabb45108a29e45667799fd5674bc5e04b1dd01c24131a79308f

I was doing the same as for VolumeLifecycleModes. If I read the code in pkg/registry/storage/csidriver/strategy.go right, updating a CSIDriver object while the feature is disabled preserves the field. I think that addresses the concern about data loss.

But should that field be sent to clients when they ask for a v1 CSIDriver object and the apiserver has the feature disabled? My understanding is that @wojtek-t wants it to be sent whereas the current implementation of that and other fields doesn't do it.

... or at least I think it doesn't. Let me double-check that.

I was wrong, sorry for the confusion. "kubectl get" doesn't show the field after disabling the feature in the apiserver, but it's still there in the actual object. "kubect get -o yaml" shows it.

That means I need to update the KEP to reflect the implementation, but the implementation itself should be okay.

I've pushed a commit with the KEP change. Let's keep it as a separate commit to document in eternity how confused I was about this... 😅

The implementation in the apiserver already preserves the field and delivers it to clients, which is the exact opposite of how it was described in the KEP.

annajung · 2021-02-03T16:55:49Z

keps/sig-storage/1472-storage-capacity-tracking/kep.yaml

+  - "@wojtek-t"
 see-also:
  - "https://docs.google.com/document/d/1WtX2lRJjZ03RBdzQIZY3IOvmoYiF5JxDX35-SsCIAfg"
 latest-milestone: "v1.19"


Hi there, 1.21 enhancements lead here.
Could you also make a change to the latest milestone?

Suggested change

latest-milestone: "v1.19"

latest-milestone: "v1.21"

msau42 · 2021-02-03T19:03:37Z

/lgtm
/approve

Thanks for working through everything and making this very comprehensive!

wojtek-t

Two last very minor comments - LGTM once fixed.

wojtek-t · 2021-02-04T12:52:31Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

-  be treated with exponential backoff just like other communication
-  issues with the API server.
+  The new flag in `CSIDriver` will be preserved when disabling the
+  feature gate or when rolling back to a release older than 1.19. kube-scheduler


on rollback it won't be preserved, because CSIDriver type will not have that field in older release.

So it's only preserved on feature gate disablement.

wojtek-t · 2021-02-04T12:54:16Z

keps/sig-storage/1472-storage-capacity-tracking/README.md

+  they have a topology/storage class pair that is still valid
+- number of such objects that currently exist and need to be deleted
+  because they have an outdated topology/storage class pair
+- work queue length for creating, updating or deleting objects


Please add a TODO to fill in metric names here explicit once those will get implemented.

wojtek-t · 2021-02-04T13:38:22Z

/lgtm
/approve

Thanks!

k8s-ci-robot · 2021-02-04T13:38:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot assigned wojtek-t May 29, 2020

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2020

k8s-ci-robot requested review from saad-ali and xing-yang May 29, 2020 12:45

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels May 29, 2020

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch 2 times, most recently from 1073bcb to f111fcf Compare May 29, 2020 14:42

wojtek-t reviewed May 29, 2020

View reviewed changes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 2, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2021

wojtek-t reviewed Jan 18, 2021

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 26, 2021

pohly changed the title ~~storage capacity: finish 1.19 review~~ storage capacity: PRR and beta review Jan 27, 2021

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from c0b1f88 to d2daf15 Compare January 28, 2021 10:49

wojtek-t reviewed Jan 28, 2021

View reviewed changes

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 29, 2021

msau42 reviewed Jan 29, 2021

View reviewed changes

pohly commented Jan 29, 2021

View reviewed changes

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 4787e42 to 0f8594e Compare January 29, 2021 19:44

wojtek-t reviewed Feb 1, 2021

View reviewed changes

pohly commented Feb 1, 2021

View reviewed changes

wojtek-t reviewed Feb 2, 2021

View reviewed changes

pohly commented Feb 2, 2021

View reviewed changes

msau42 reviewed Feb 2, 2021

View reviewed changes

k8s-ci-robot added the sig/auth Categorizes an issue or PR as relevant to SIG Auth. label Feb 2, 2021

storage capacity: finish PRR review

f70ed9d

This marks items as completed which are (or soon will be) done for beta. PRR review will happen as part of merging this.

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 09e8c12 to f70ed9d Compare February 3, 2021 09:51

wojtek-t reviewed Feb 3, 2021

View reviewed changes

storage capacity: fix description of CSIDriver field rollback

6728c11

The implementation in the apiserver already preserves the field and delivers it to clients, which is the exact opposite of how it was described in the KEP.

annajung reviewed Feb 3, 2021

View reviewed changes

storage capacity: bump latest-milestone to 1.21

9d85d7e

k8s-ci-robot assigned msau42 Feb 3, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2021

wojtek-t reviewed Feb 4, 2021

View reviewed changes

storage capacity: review feedback

8f150e8

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2021

k8s-ci-robot merged commit 0a495c8 into kubernetes:master Feb 4, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 4, 2021

pohly mentioned this pull request Feb 9, 2021

Storage Capacity Tracking #1472

Closed

4 tasks

storage capacity: PRR and beta review #1829

storage capacity: PRR and beta review #1829

Uh oh!

Conversation

pohly commented May 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly commented Jun 2, 2020

Uh oh!

fejta-bot commented Aug 31, 2020

Uh oh!

pohly commented Aug 31, 2020

Uh oh!

deads2k commented Oct 5, 2020

Uh oh!

pohly commented Oct 5, 2020

Uh oh!

fejta-bot commented Jan 3, 2021

Uh oh!

pohly commented Jan 3, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pohly Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pohly Jan 28, 2021 •

edited

Loading