Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow CSI drivers to specify how many volumes can be attached to single node #217

Closed
gnufied opened this issue Apr 10, 2018 · 11 comments
Closed

Comments

@gnufied
Copy link
Contributor

gnufied commented Apr 10, 2018

As part of fixing - kubernetes/enhancements#554 , I believe we need to extend the specs to add support for max number of volumes that can be attached to a single node. For sake of discussion lets call it - max_volume_count.

There are couple of places we can do this:

  1. GetPluginInfoResponse seems like one of the easiest places. The problem of course is - manifest field inside is a map[string]string, so we will have to convert string to int before using the field. But there may be bigger problem that - same driver running on 2 different nodes may have different values of max_volume_count. For example - EBS driver on node A may support 40 while it may only support 20 volumes on node B.

  2. Another option is something like NodeGetCapabilitiesResponse.

Does anyone have any other thoughts on this?

@msau42
Copy link

msau42 commented Apr 16, 2018

The max count should be reported per node. Each node could support a different number of volumes depending on its size.

@cpuguy83
Copy link
Contributor

I'm not sure I understand the design of this. Can you give the details of when/why this would be needed?

@msau42
Copy link

msau42 commented Apr 17, 2018

Many cloud provider's storage have limits on how many of those devices can be attached to a node, and the limit can be dependent on node size.

@gnufied
Copy link
Contributor Author

gnufied commented Apr 17, 2018

Yes for example - only 40 EBS volumes (it can vary sometimes depending on node size) can be attached to a EC2 node.

The CSI driver needs to provide hints to CO, so as CO can stop scheduling new containers to the node if node is already at maximum capacity.

@cpuguy83
Copy link
Contributor

Ah, ok. This makes sense.

@chhsia0
Copy link
Contributor

chhsia0 commented Apr 18, 2018

In the community meeting today we discussed that this could be used for making the scheduler smarter, where the scheduler keeps track of the current count of volumes "available" to a node and get the max count from the node to infer the remaining capacity, and use this information to help making scheduling decisions. The problem of this approach is that there is no reliable way for the CO to recover the current usage if anything happens out-of-band.

The problem of not directly providing the current capacity is that this is a dynamic property and there might be a race between querying the current capacity from a node and scheduling some other workloads. This makes me think that if the available capacity can be obtained at the controller side given a node id. If yes, then we could have a controller API to return the "available_publish_capacity" instead, and the CO could reliably get the available count and potentially synchronize its interaction with the controller plugin to avoid races.

Thoughts?

@msau42
Copy link

msau42 commented Apr 18, 2018

The problem is synchronizing the scheduler with "current available" introduces signficant latency and still doesn't completely solve race conditions. The scheduler has to wait until the controller updates the "current count" before scheduling the next workload. And you still have the race where after the scheduler reads the value, some out-of-band attach could happen to modify the count.

@chhsia0
Copy link
Contributor

chhsia0 commented Apr 18, 2018

Yeah my proposal still doesn't resolve the out-of-band cases. Then it makes sense to have such a static information provided by the node plugin.

@gnufied
Copy link
Contributor Author

gnufied commented Apr 18, 2018

I am leaning towards adding "best effort" number of "currently attached volumes" to a node rpc call. Yes - the result could be racy in CO because if based on that information it schedules bunch of workloads to a node and then node ends up overshooting the threshold.

But I think solving this race condition is not within scope of CSI. We can report number of volumes attached at a point in time, a scheduler may choose to implement a internal cache that handles the race of scheduling too many workloads at once. @chhsia0 @cpuguy83 thoughts?

@gnufied
Copy link
Contributor Author

gnufied commented Apr 18, 2018

And yes - if we implement such a rpc call, then that information can be directly queried from SP and returned back. This will take into account - out-of-band attach requests too.

@pure-yesmat
Copy link

I am not sure if this is the best place to bring this up, but in the last meeting there was the following decision:

Limit this feature to only ControllerPublish, explicitly callout that NodeStage and NodePublish are not going to be considered for attaching a volume.

I am worried that this is a slippery slope. Limiting features based on the implementation of optional services will eventually mean that this service is not really optional. Also if we start now it will just pile up in the future.

Note that I am not against this decision, however, I am not sure that this is necessarily better than adding an extra call to find out if a volume is attached since there are many places where this can happen (ControllerPublish, NodeStage, NodePublish). The extra call might "pollute" the spec, but by taking the other route we are essentially saying that ControllerPublish is no longer optional.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants