-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow CSI drivers to specify how many volumes can be attached to single node #217
Comments
The max count should be reported per node. Each node could support a different number of volumes depending on its size. |
I'm not sure I understand the design of this. Can you give the details of when/why this would be needed? |
Many cloud provider's storage have limits on how many of those devices can be attached to a node, and the limit can be dependent on node size. |
Yes for example - only 40 EBS volumes (it can vary sometimes depending on node size) can be attached to a EC2 node. The CSI driver needs to provide hints to CO, so as CO can stop scheduling new containers to the node if node is already at maximum capacity. |
Ah, ok. This makes sense. |
In the community meeting today we discussed that this could be used for making the scheduler smarter, where the scheduler keeps track of the current count of volumes "available" to a node and get the max count from the node to infer the remaining capacity, and use this information to help making scheduling decisions. The problem of this approach is that there is no reliable way for the CO to recover the current usage if anything happens out-of-band. The problem of not directly providing the current capacity is that this is a dynamic property and there might be a race between querying the current capacity from a node and scheduling some other workloads. This makes me think that if the available capacity can be obtained at the controller side given a node id. If yes, then we could have a controller API to return the "available_publish_capacity" instead, and the CO could reliably get the available count and potentially synchronize its interaction with the controller plugin to avoid races. Thoughts? |
The problem is synchronizing the scheduler with "current available" introduces signficant latency and still doesn't completely solve race conditions. The scheduler has to wait until the controller updates the "current count" before scheduling the next workload. And you still have the race where after the scheduler reads the value, some out-of-band attach could happen to modify the count. |
Yeah my proposal still doesn't resolve the out-of-band cases. Then it makes sense to have such a static information provided by the node plugin. |
I am leaning towards adding "best effort" number of "currently attached volumes" to a node rpc call. Yes - the result could be racy in CO because if based on that information it schedules bunch of workloads to a node and then node ends up overshooting the threshold. But I think solving this race condition is not within scope of CSI. We can report number of volumes attached at a point in time, a scheduler may choose to implement a internal cache that handles the race of scheduling too many workloads at once. @chhsia0 @cpuguy83 thoughts? |
And yes - if we implement such a rpc call, then that information can be directly queried from SP and returned back. This will take into account - out-of-band attach requests too. |
I am not sure if this is the best place to bring this up, but in the last meeting there was the following decision:
I am worried that this is a slippery slope. Limiting features based on the implementation of optional services will eventually mean that this service is not really optional. Also if we start now it will just pile up in the future. Note that I am not against this decision, however, I am not sure that this is necessarily better than adding an extra call to find out if a volume is attached since there are many places where this can happen (ControllerPublish, NodeStage, NodePublish). The extra call might "pollute" the spec, but by taking the other route we are essentially saying that ControllerPublish is no longer optional. Thanks |
As part of fixing - kubernetes/enhancements#554 , I believe we need to extend the specs to add support for max number of volumes that can be attached to a single node. For sake of discussion lets call it -
max_volume_count
.There are couple of places we can do this:
GetPluginInfoResponse
seems like one of the easiest places. The problem of course is -manifest
field inside is amap[string]string
, so we will have to convert string to int before using the field. But there may be bigger problem that - same driver running on 2 different nodes may have different values ofmax_volume_count
. For example - EBS driver on node A may support 40 while it may only support 20 volumes on node B.Another option is something like
NodeGetCapabilitiesResponse
.Does anyone have any other thoughts on this?
The text was updated successfully, but these errors were encountered: