-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed provisioning #524
distributed provisioning #524
Conversation
Before I proceed with this feature, I'd like to get feedback whether the approach of letting the "selected node" annotation be set by the external-provisioner is acceptable for volumes with immediate binding. I have tested this on a 50 node cluster and it worked nicely. My argument why it is compatible with the current design is that the state of the PVC is indistinguishable from a PVC that was assigned to a node while the storage class used late binding. This can happen and if the current code cannot handle this, then it needs to be fixed. However, I don't see why it should fail (famous last words...). @msau42, @jsafrane: what do you think about the UML diagram? That commit itself is unrelated to this PR, I could also submit it separately. I added it here because it makes it easier to understand how the node-local deployment fits into the overall architecture. |
Hmm, looks like my recent rebasing broke the unit tests. Will fix that... |
06d26a9
to
b89078f
Compare
cc @jingxu97 |
pkg/controller/controller.go
Outdated
klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts) | ||
sleep, cancel := context.WithTimeout(ctx, delay) | ||
defer cancel() | ||
ticker := time.NewTicker(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider using claimInformer to get the events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of polling? Yes, I had been thinking about that, but couldn't come up with a good way to hook up the informer callback with this function here.
Callbacks can only be added, but not removed. So I would have to install some generic callback that gets told at runtime who is interested in which PVC - felt rather complicated for a rather small improvement in latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the polling period to be so frequent considering that the delays are in seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not. I've replaced it with 'base delay / 100', so now it scales together with the base delay and (for the default) is at 300ms.
I don't want to make it too large either because then one go routine would be blocked for potentially quite a while before it becomes available for some other PVC work item. With many provisioners running in parallel the actual smallest delay can be smaller than the maximum of 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will be some changes necessary for scalability, but I'm fine with merging this and fix them later after we get some numbers.
One thing troubles me more: PV deletion.
What should happen when a node is removed from the cluster? All its PVs can't be deleted - there is no provisioner that would do that. Is it just a documentation issue? Or shall there be a cluster-level provisioner that deletes PVs from missing nodes? What if the node rejoins the cluster?
pkg/controller/controller.go
Outdated
// themselves. | ||
// | ||
// With a value of 10 seconds, when creating 5000 | ||
// volumes on a cluster with 50 instances only ~300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, nr. of conflicts does not grow linearly with nr. of nodes, right? 5000 nodes != 30 000 conflicts, but much more. How much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to test that.
pkg/controller/controller.go
Outdated
// It might make sense to make this value configurable so | ||
// that CSI driver deployments can tweak it depending | ||
// on their needs. | ||
baseDelay := 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IMO needs to be configurable or grow with the cluster size (as a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got me thinking. The current implementation does exponential backoff per PVC. This is almost never going to make a difference in practice because after the initial conflict, the PVC has a selected node and there's no need to do another update attempt for it. The only exception is when provisioning on that selected node fails and there has to be a global retry - that should be rare.
In this approach, we can't make this grow with cluster size either: we cannot assume that the CSI driver runs on all nodes. It might manage specialized resources that are only available on some nodes. With deployment handled by Kubernetes, the individual instances have no information which other instances exist.
I had considered making this independent of the PVC. The issue with that is the question of when to reduce the delay and how much. When the "winning" instance resets it to zero, then it is also likely to win for the next PVC until its space is exhausted and it stops trying to own a PVC (because of the capacity check). I preferred even spreading across the cluster. But perhaps filling up one node is also fine: if users want a certain policy for volume scheduling, they should use late binding and influence volume placement indirectly through that.
I think I will give that a try...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ended up working worse. The main problem was that the instance with the smallest backoff delay ends up grabbing more volumes for provisioning than what it has space for, so once it is the owner it fails for at least some of them. Those then are made available to other instances, but the same pattern just repeats.
It could be made to work if the provisioner could model remaining capacity considering how many volumes already were assigned, but that is driver specific and complex.
This is more of a problem under load (= many pending PVCs) and a full cluster (= capacity exhausted on some nodes), but still, when volumes are spread evenly purely based on statistics, it worked better.
So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'll stick with the current approach and perhaps tweak it a bit: when tracking the number of collisions per number of PVCs, it should be possible to scale the base delay.
That has the same problem. When using a running average of the failure rate as indicator for "delay is too low", I saw increasing numbers of failed provisioning attempts because of full nodes.
TLDR: cluster admins will have to adjust the delay for the expected cluster size and number of volumes.
I just had a call today regarding that: I will get access to a 100 node cluster from SIG-Scalability that I can run scalability tests on. We don't need to merge for that, I'll simply use a custom image. If it works on 100 nodes, I'll try to run with more.
I think we need to document how to deal with this: if the admin is sure that the node is gone, they will have to force-delete orphan PVs from the node, including removal of finalizers. The same problem occurs when the CSI driver handles communication with nodes internally - if it handles the issue. In PMEM-CSI we have an open bug around that, just that we erred towards allowing PV removal when uncertain, with the result that we sometimes don't delete a volume on a node.
That's exactly the problem. Without further information, Kubernetes cannot know whether the node is truly gone. I'm hoping that the work on better node shutdown handling may lead to a better solution, but that'll take time. |
89655f7
to
36e96f4
Compare
/assign @verult |
aeb9002
to
d9ae323
Compare
I've pushed updated commits that reflect the current status. I included a temporary commit for using kubernetes-sigs/sig-storage-lib-external-provisioner#100 but there is some issue with the replace statement. @verult IMHO it makes sense to review anyway. Let's do a new sig-storage-lib-external-provisioner release with that pending PR and then I can use that cleanly here. |
0538807
to
14e474f
Compare
I've added one commit with documentation for that.
I took out that commit to make this PR ready for merging now.
|
14e474f
to
1aaef5f
Compare
/retest |
1 similar comment
/retest |
/rete |
/retest |
Immediate binding is not recommended, but is needed for the sake of feature parity. With immediate binding also supported, the code no longer just passively checks the selected node, so a different name seems more appropriate. Besides implementing immediate binding support, the original implementation also gets fixed: DeleteVolume was called by all external-provisioner instances. On most nodes that then looked like the volume had been removed already and the PV got removed before the actual node had a chance to finish the volume deletion.
When deploying external-provisioner on each node, the topology information that it needs is most likely just the values reported by the local CSI driver instance. We can avoid the extra work for watching Node and CSINode in that case.
This is intentionally a separate section because although it applies to distributed provisioning, the same problem also arises when a CSI driver handles provisioning of local volumes differently.
Producing CSIStorageCapacity objects for a node uses the same code, the only difference is that there is just a single topology segment that the external-provisioner needs to iterate over. Also, that segment is fixed. Therefore we can use the simple mock informer that previously was only used for testing.
3e7ea65
to
b9301ee
Compare
Rebased to resolve the conflict in pkg/controller/controller.go. |
README.md
Outdated
|
||
* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default. | ||
|
||
* `--node-deployment-immediate-binding`: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use `--node-deployment-immediate-binding=false` if not desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to have this option. While we should encourage everyone to use delayed binding, immediate binding is still an available option that users can set independently of a driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I saw this before seeing the explanation below for a custom policy. Maybe worth mentioning that you should set it to false if you want to implement your own custom algorithm for immediate binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Added "Disabling it may be useful for example when a custom controller will select nodes for PVCs."
README.md
Outdated
|
||
* `--node-deployment-max-delay`: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds. | ||
|
||
* `--local-topology`: Instead of watching Node and CSINode objects, use only the topology provided by the CSI driver. Only valid in combination with `--node-deployment`. Disabled by default, but recommended for drivers which have a single topology key with different values for each node (i.e. local volumes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see a reason why someone would want local-topology false and node-deployment? I would prefer not supporting non-local topology in node mode until a use case comes up that needs it since per-node informers can be expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see a reason why someone would want local-topology false and node-deployment?
I could not quite convince myself that there really is no situation where a CSI driver wants local deployment and has more complex topology. I don't have a specific example, it is just a feeling.
I don't mind removing the command line option and always do "local topology".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that would simplify things initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I kept the removal in a separate commit, in case that we want to bring it back.
@@ -84,6 +84,18 @@ See the [storage capacity section](#capacity-support) below for details. | |||
|
|||
* `--capacity-for-immediate-binding <bool>`: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to `false`. | |||
|
|||
##### Distributed provisioning | |||
|
|||
* `--node-deployment`: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be consolidated with --capacity-controller-deployment-mode=local
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to to do local deployment without enabling storage capacity because that might not be supported by the cluster or not desired, so we cannot have just one flag that enables both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can avoid flag skew issues like --node-deployment=true
and --capacity-controller-deployment-mode=central
.
Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can change the capacity flag to be a boolean, and then determine which mode to use based on node deployment?
In retrospect that would be nicer, but it is a bit late to change the semantic of the --capacity-controller-deployment-mode
that way - we would have to do a major update of external-provisioner for that because it is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an alpha feature, I think we can make breaking changes to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So --enable-capacity=true/false
instead of --capacity-controller-deployment-mode=central
and the actual mode of operation (central vs. local) determined by --node-deployment
?
Works for me, I just won't get to it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me! Do you think enable-capacity will always be a permanent flag, or would it fit a feature gate better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a permanent flag is better. The feature causes additional work and might not be that relevant for some drivers (like those where all volumes are available everywhere), so being able to keep it turned off will remain useful.
with `--node-deployment-max-delay` anyway, to avoid very long delays | ||
when something went wrong repeatedly. | ||
|
||
During scale testing with 100 external-provisioner instances, a base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point to links to results anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment. I intend to write up my experience with scale testing and will publish it as a .md file in perf-tests/log
but haven't gotten around to it yet. I can link to that once it is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a chance to run on 1000 nodes. Distributed provisioning was almost as fast as central provisioning in terms of volumes/second for immediate binding with no overload of the apiserver - see report in kubernetes/perf-tests#1676
was the same as with a delay of 10 seconds. The worst-case latency per | ||
volume was probably higher, but that wasn't measured. | ||
|
||
Note that the QPS settings of kube-controller-manager and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point to bug tracking api fairness effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added " Those
settings will eventually get replaced with flow control in the API
server
itself."
If there still was a PVC which was bound to that PV, it then will be | ||
moved to phase "Lost". It has to be deleted and re-created if still | ||
needed because no new volume will be created for it. Editing the PVC | ||
to revert it to phase "Unbound" is not allowed by the Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to mention the last sentence. There's a lot of things we can't do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in this case it is something that might occur to a reader. It certainly occurred to me, so I tried it - unsuccessfully 😅
By saying upfront that it won't work we save someone who is in that situation the time to try that out.
Kubernetes cannot be sure that it is okay to remove the PV. | ||
|
||
When an administrator is sure that the node is never going to come | ||
back, then the local volumes can be removed manually: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that they also should make sure the data on the disks are deleted before bringing the disk back into service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added "It may also be necessary to scrub disks before reusing them because
the CSI driver had no chance to do that."
pkg/controller/controller.go
Outdated
// NodeDeployment contains additional parameters for running external-provisioner alongside a | ||
// CSI driver on one or more nodes in the cluster. | ||
type NodeDeployment struct { | ||
NodeName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments on what each of these fields do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
switch selectedNode { | ||
case "": | ||
logger := klog.V(5) | ||
if logger.Enabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we do this method of checking log level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because then the check only needs to be done once and the overhead for defer
can be avoided entirely in most cases. It's not important, I can also use klog.V(5).Info
instead if you prefer that.
pkg/controller/controller.go
Outdated
klog.V(5).Infof("will try to become owner of PVC %s/%s in %s, attempt #%d", claim.Namespace, claim.Name, delay, attempts) | ||
sleep, cancel := context.WithTimeout(ctx, delay) | ||
defer cancel() | ||
ticker := time.NewTicker(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the polling period to be so frequent considering that the delays are in seconds?
685a7a3
to
28257ef
Compare
@msau42 : please take another look. |
It is uncertain whether that option is needed. Removing it simplifies the code.
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…se-4.4 Update go modules
What type of PR is this?
/kind feature
What this PR does / why we need it:
A CSI driver which manages local resources typically has no control plane and thus (currently) can only provision volumes on a single node. With this PR, external-provisioner gets extended to support deployment alongside a local CSI driver on each node.
The csi-driver-host-path deployment could be extended to use this.
Which issue(s) this PR fixes:
Fixes #487
Special notes for your reviewer:
This is based on #367. Out of courtesy to @jsanda, the original commit is the one from that PR although it gets changed substantially later on. I could also squash more aggressively, if that is desired.
TODO:
Does this PR introduce a user-facing change?: