-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Cluster Scoped Resources #1400
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@@ -0,0 +1,76 @@ | |||
# Cluster Scoped Resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the KEP process outlined by @kubernetes/sig-architecture-feature-requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is KEP now a requirement or a recommendation? That was not clear from the contributor summit discussions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @jdumars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh @timothysc: is this the template that needs to be followed: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md
Cluster scoped resources are consumable resources that do not belong to any specific node but instead are available across mulitple nodes in a cluster. These resources are accounted as other consumable resources and should be usable by the scheduler while deciding if a pod can actually be scheduled. | ||
|
||
|
||
## Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Software licenses are the most common reason for such features in other systems.
6f9143b
to
62204c3
Compare
|
||
|
||
## Motivation | ||
Resources in Kubernetes such as cpu and memory are available at a node level and can be consumed by pods by requesting them. However there are some resources that do not belong a specific node, but they are consumable across all or a group of nodes in the cluster. As an example, IP addresses in a pool can be shared across pods running on multiple nodes in a network scope. Another use case could be, locally attached shared storage in a rack, which is consumable across several nodes. Hence there is a need to represent such a resource at cluster level which is consumable acroass all or a group of nodes in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like a node group scoped resources in the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a list of 5-8 example resources that would be tracked like this. I’d like more validation and concrete discussion on each type to guide design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. There are many use cases for extending resource APIs and I'd like to first get a collection of use-cases before identifying possible solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few use cases
cc/ @kubernetes/sig-scheduling-feature-requests |
Thanks for writing this, it's definitely a feature we have been talking about for a while. I think a complete solution to this problem should consider how the resource allocator for the cluster-level resource fits in. I think that cluster-scoped resources are likely to have some kind of external allocator, for example the agent that hands out IP addresses or software licenses. It's important for the scheduler's view of free resources to stay in sync with that of the external allocator, which has the authoritative information, so that we can minimize the likelihood that a container starts up and finds that the resource is not actually available. For example, with a normal resource the scheduler assumes the resources become freed when the pod terminates or is deleted. But with cluster-level resources, if we leave the allocation and deallocation of the resource up to the container, it might be possible to leak resources (container forgets to release the IP address or license, or gets killed before it can, so the resource is still allocated but the scheduler thinks it is free because the pod has terminated). So maybe the scheduler should be responsible for reserving the resource from the allocator before binding the pod, and unreserving the resource via the allocator when the pod terminates. It's probably quite complicated to ensure that a container only tries to allocate resources that have been reserved for it, so it's probably not a "secure" solution but might be good enough. One approach is what we did for PDB and ResourceQuota, where decrementing the amount free is synchronous with requesting it (in the cluster-scoped resource case, this would mean the scheduler decrements the free) but replenishing the resource when it is no longer in use is asynchronous and done by a separate controller (could be the agent that is responsible for the cluster-level resource, when a container deallocates the resource). |
} | ||
|
||
// ClusterResource represents a resource which is available at a cluster level | ||
type ClusterResource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a form of quota, it should be named as such - ClusterResourceNodeQuota. It’s not actually clear how this api aligns with ResourceQuota, please comment to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterResource
is an api type that represents a cluster scoped resource. However it's integration with resourcequotas needs to be added, probably at later a phase such as beta
?
// pkg/api/types.go: | ||
|
||
// ClusterResourceQuantity represents quantity of a ClusterResource | ||
type ClusterResourceQuantity struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the discovery/initialization flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cluster admin or other controllers will post the ClusterResource
objects that captures the capacity and allocatable quantities of aClusterResource
, which will then be used by scheduler
} | ||
``` | ||
|
||
`clusterinfo` is added to scheduler cache to do accounting for ClusterResources consumed by pods. `clusterInfo` will be exposed to the predicate and priority functions in order to take ClusterResources into consideration while making scheduling decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How clusterinfo
will be build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clusterinfo
can be build similar to how we build nodeInfo
since scheduler will be watching for ClusterResources
|
||
ClusterResources are consumable by pods just like CPU and memory, by specifying it in the pod request. The scheduler should take care of the resource accounting for ClusterResources so that no more than the available amount is simultaneously allocated to Pods. The prefix used to identify a ClusterResource coule be | ||
``` | ||
pod.alpha.kubernetes.io/cluster-resource- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of special prefixes. I'd like to see if we can avoid overloading resource names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we just moved away from this pattern with extended resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can follow fully-qualified resource names similar to extended resources, but we need to see how will those be differentiated
62204c3
to
69a0ddd
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nikhildl12 Assign the PR to them by writing The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@davidopp @timothysc @vishh: I would like to understand what could the next steps be for this proposal. As a first action item, I can submit this in the form of a KEP: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md |
@nikhildl12 One important step is to sort out your CLA, as outlined here: #1400 (comment) /ok-to-test |
pod.alpha.kubernetes.io/cluster-resource- | ||
``` | ||
|
||
### Accounting in scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have extended resources and several types of first class resources in the scheduler already. I think it would be possible to come up with a single presentation that covers all of these types. For example, I don't see much of a difference between a cluster resource and extended resource from scheduler's point of view. An extended resource with an additional "type" can represent a cluster resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference between these two resources is the "scope".
Currently extended resources are exposed as a part of node status because of their nature of being tied to a node, while cluster scoped resources have to be represented outside the scope of a node. But we can surely have a comprehensive API that covers both. From the scheduler's point of view, it will need some additional logic to calculate and cache the available capacity of a cluster scoped resource across a set of nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the "ResourceClass" that @jiayingz is working on is an effort in that direction to provide a comprehensive API to represent various types of resources, including cluster resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As @bsalamat mentioned, we are working on a new Resource API proposal that aims to provide a comprehensive API for both node-level resources and cluster-level resources. Here is the current PR:
#782
It is still WIP and the current plan is to focus on node-level resources during the initial phase. But I think even the initial API should help solve some of the listed problems here. Please take a look and let us know if you see any missing pieces.
|
||
### Accounting in scheduler | ||
|
||
ClusterResources should be tracked as normal consumable resources and should be considered by the scheduler when determining if a pod can actually be scheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another important aspect of cluster resources which is not covered here is how to bind these resources to a chosen node during/after scheduling. A fairly complex logic is already added to scheduler to handle provisioning and binding PVs to nodes during scheduling. Similar processes may be needed for other resources, such as TPUs, etc. I think that aspect should be covered by the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cover that aspect, I prefer the approach mentioned by @davidopp in his previous comment. The external agent/controller which exposes the available capacity of this resource can be made responsible for binding or making sure that those resources are ready to use when a pod is going to run on a node. Similarly when a pod dies, that agent needs to deallocate/unbind the corresponding resource and increment the available quantity so that it can be used for scheduling of new pods
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
why has this been abandoned ? The proposal seems fair ? |
Initial proposal for cluster scoped resources
Related: kubernetes/kubernetes#19080 and https://groups.google.com/forum/#!topic/kubernetes-users/eUUrdlBwa7g