-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: initial provisioning rollback KEP #1703
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
picking some other node, the library can treat unused volumes as "not | ||
provisioned yet" and set a new selected node for them. | ||
|
||
The external-provisioner then needs to de-provision those PVCs which are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we signal to external-provisioner to delete these PVs? Normally it's signaled when a user deletes a PVC object, but I don't think we can delete the PVC object here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the tentatively provisioned volume is no longer accessible from the currently selected node.
|
||
## Proposal | ||
|
||
The proposal is to mark provisioned PVCs as "unused" via a label until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who sets this label to "used"? How do we ensure there aren't races between the time we check the label and when we go to delete the PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can solve this entirely in the external-provisioner: in addition to the "selected node" label that it gets already now, it also needs to know which pod(s) are triggering the provisioning so that it can group several PVCs together.
Then we need an intermediate phase where external-provisioner creates PVs, but doesn't finally bind them to the PVCs yet. That's the "unused" state. Then when all volumes have been created, external-provisioner finishes the binding.
But it will still be tricky to update several objects together atomically 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that: there is more than one external-provisioner involved, so this is the wrong place.
A better place is the volume scheduling code. That's where kube-scheduler checks whether a pod can move ahead; we can block that until all volumes have been provisioned. Before giving it's okay, that code can also remove the "unused" flag.
|
||
The proposal is to mark provisioned PVCs as "unused" via a label until | ||
a pod actually starts to use them. Another label stores the node that | ||
the PVC was provisioned for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have "selected-node" on the PVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that's the label that describes the desired state and when we reschedule, that will change.
My intention was to add another label "provisioned-node" and then compare that against "selected-node" to detect when a volume no longer meets the requirements. But that may be too strict. A better solution would be to use the accessible topology of the volume to determine whether it needs to be recreated: if the currently selected node has no access to the volume, then (and only then) do we delete it and try again.
- Only CSI drivers using an up-to-date external-provisioner will be | ||
supported. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically to avoid races that could result in deletion of a volume in use.
So something that can prevent a Pod from using a volume before all the volumes are successfully provisioned. Maybe something like taints can help us here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically
As multiple different storage systems are involved, it will have to be Kubernetes which treats the set of volumes as something that is kept unused until all volumes are ready.
avoid races that could result in deletion of a volume in use
I don't think the current proposal would have led to deleting a volume that is in use, because once it is used, it won't be deleted anymore. But another pod starting to use one of the volumes is a potential issue because that then can prevent rescheduling of the original pod which triggered provisioning.
So something that can prevent a Pod from using a volume before all the volumes are successfully provisioned. Maybe something like taints can help us here.
Yes, something like that or another label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer having some stronger guarantee on provisioning and rolling back a group of volumes atomically
As multiple different storage systems are involved, it will have to be Kubernetes which treats the set of volumes as something that is kept unused until all volumes are ready.
As for static binding, we need to reverse multiple volumes for later use too (kubernetes/kubernetes#73263). It's a special case in which volumes are already provisioned and don't need provisioning rollback.
Meeting Notes
|
Do we try to rollback PV/PVC binding or bind multiple PV/PVC objects together? That would solve this problem too. If possible, we can solve static binding and dynamic provisioning problems together. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is a WIP but wanted to note it is missing a kep.yaml
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
|
||
## Alternatives | ||
|
||
Instead of allocating volumes, space could be merely reserved in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Bare-metal CSI we decided to go with reservation approach
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle stale |
Is this being actively reviewed? How does it play with GA graduation? |
This is a problem for all volumes with "wait for first consumer" provisioning mode, whether they use storage capacity tracking or not. Interest in solving this seems to be low, therefore this KEP has not progressed beyond the initial "we could solve it like this" phase. That's probably because it doesn't really affect many users. |
I'm very interested for this issue to be addressed. |
@mishoyama: Is that because you have pods with multiple volumes and these volumes are dynamically provisioned with "wait for first consumer"? Can you describe your use case a bit? For example, how long are pods running? |
@pohly yes, we have multiple volumes per pod with "wait for first consumer" binding mode. This is storage system and pod lifecycle is not limited. |
What's the advantage of having multiple volumes per pod instead of just one? Do they come from different storage providers? Are you perhaps interested in working on this? The SIG Storage community meeting would be a good place to start discussing who could work on this and when. |
We might have up to 100 disks attached to the physical node. Handling IO requests in a single pod allows to manage related resources more effectively. At the current moment we use scheduler extender to solve the problem with storage capacity. But anyway I'm interested in working on this. |
/close I don't have time to work on this anytime soon. If someone wants to pick it up, please reach out to SIG Storage. |
@pohly: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add PratikMahajan to approvers
Currently pods can get stuck when they use multiple volumes with late binding. The case with a single volume is handled by rescheduling the pod, but that can fail when some volumes already got provisioned and then provisioning of the remaining ones fails.
The proposal is to rollback the provisioning in that case to unblock rescheduling.