-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: New Resource API proposal #2265
KEP: New Resource API proposal #2265
Conversation
/uncc |
/assign @dchen1107 @derekwaynecarr |
operator: "GtEq" | ||
values: | ||
- "30G" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind to add another yaml of how Pod reference ResourceClass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. will do.
a6047a0
to
1672a1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Other than per-device quota, I feel most other use cases are ahead of their time.
- If Quota is the main problem to tackle right now, then does it require a new set of Resource APIs or can be solved via [admission extensions] (https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyY) or some other way?
- If you'd like to test the waters with other use cases, this proposal should ideally be implemented via extensions. If extensions are inadequate, we should try to address extension gaps.
keps/sig-node/00014-resource-api.md
Outdated
## Use Stories | ||
### As a cluster operator: | ||
- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.<br/> | ||
**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any users today that need this feature from kubernetes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for this user story, I think we could use NodeAffinity
to chose different GPU type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NodeAffinity shares similar problem as node taint that cluster administrators can not apply proper access control to restrict a user pod from not using them. As mentioned at the very beginning, the goal of this proposal is "to better support non-native compute resources on kubernetes". We want to allow users to request them as compute resources, and allow administrators to control their access through the resource quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in-line with general QoS model. We might like to experiment with this model in Openshift. /cc @derekwaynecarr
Wondering how NodeAffinity
can be tied with usage metrics which will be needed to charge as per usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without knowing for sure real users will benefit from it, i don't see why we'd solve this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See ur point, and I always think a more general resource API model should be better than label based solutions. :)
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to build a feature, we should have a clear user identified that we know will consume the feature to guide the use-case. i think we need to evaluate the feature relative to other ideas seen in the ecosystem.
for the similar use-case of "specify gpu attributes such as gpu type and memory requirements for deployment in heterogenous GPU clusters", nvidia appears to enable this by carrying two API fields on the pod spec.
see:
https://developer.nvidia.com/kubernetes-gpu
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2389
it would be good to evaluate resource class versus this other approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd first like to understand if gpu type and memory requirements are a real user concern today in the first place even before considering possible solutions.
There are users who are sufficiently happy with using node selectors and most users today seem to bind pods to specific gpu types either for cost or specific memory requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr we have scalability concerns with kubernetes-gpu resource. Selectors computation will be done for each compute resource in scheduler cache. OTOH with resource classes, resource classes will be fewer than compute resources.
Another concern is portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the pointers, @derekwaynecarr
As @vikaschoudhary16 mentioned, looking at nvidia's non-upstream solution, the main difference is that they changed the current resource requirement API of the container spec. In our proposal, we explicitly mentioned that this is a non-goal for the following reasons: First, in a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. It can cause scaling issues on the scheduler side. Second, non-primary compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability in longer term. Third, resource quota control will become harder. Fourth, we may consider the resource requirement API change as a possible extension orthogonal to the ResourceClass proposal. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility.
keps/sig-node/00014-resource-api.md
Outdated
**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated. | ||
|
||
- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/> | ||
**Motivation:** Increased performance because of local reference. Local reference also helps better use of cache<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case is unclear to me. What does local reference
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 s/local reference/local access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, "local" cache from the same NUMA node in case of cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This intersects with topology awareness heavily. I think Resource Class (if it exists) should restrict itself to policy like allowing only certain shapes (2 GPU with max of 16 CPUs, ...). The topology aspect as currently planned is expected to be covered by QoS (or an additional application performance class API if necessary). Don't combine them both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this proposal focuses on is a building block that allows guaranteed metadata aware resource scheduling by surfacing resource metadata to the scheduler. I think what kind of metadata people want to surface should be left to HW vendors, resource providors, or infrastructure admins, based on different HW properties, platform environment, and workload requirements. We can provide best practice guidelines and scaling results for people to make right decisions. Node level best effort topology aware scheduling may allow better scaling but I don't think we want to take an opinioned position here.
keps/sig-node/00014-resource-api.md
Outdated
**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.<br/> | ||
**Can this be solved without resource classes:** No | ||
|
||
- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ECC is probably not a good example. Device types might be more common.
I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree on Device type. Actually, ECC enabled GPU will be a different ComputeResource
as mentioned in the sections below.
I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.
"number of dimensions" not clear to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.
I see the goal, but is there a real world use case for N? I only see the need for 1
dimension now.
keps/sig-node/00014-resource-api.md
Outdated
- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation. | ||
Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request | ||
data network connectivity with high network performance while | ||
in default case, data network connectivity is offered via normal 1 Gbps NICs.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a niche use case. Why can't the existing labels+affinity features not work for this use case?
Also, why not build policies to restrict access via admission plugins rather than adding a new core resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if people need to build and deploy various admission plugins to restrict access on different HW with different properties, that indicates the need for a general framework to support that use case.
keps/sig-node/00014-resource-api.md
Outdated
**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.<br/> | ||
**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources. | ||
|
||
- I want to be able to utilize different 'types' of a HW resource while not losing workload portability when moving from one cluster to another. There can be Nvidia GPUs on one cluster and AMD GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a dream given the state of SW today based on my experience. For example, Tensorflow struggles working seamlessly across compute types (CPU, GPU, etc) and sub-architectures (Skylake, V100, AMD).
I feel we need to wait a bit for the world to evolve for this use case to become valid in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GPUs from different vendors, agree their properties can be quite different currently, although I wonder whether the difference is less significant for certain workloads like video decoding. For high-performance nic, I think user experience is perhaps less diversified. I also feel promoting portability is always a strong motivation on kubernetes,
keps/sig-node/00014-resource-api.md
Outdated
**Motivation:** I want minimum guaranteed compute performance<br/> | ||
**Can this be solved without resource classes:**<br/> | ||
- Yes, using node labels and NodeLabelSelectors. | ||
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't quota take care of access control to an extent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please elaborate more on this that how quota can be used with labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is why we would like to introduce ResourceClass that fits naturally with resource quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh if i understand the proposed alternative, its basically treating it as an opaque resource by convention? the user still needs to couple the opaque resource consumption with the device consumption and that really cant be done until scheduling, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr if we can assume 1-1 mapping between the opaque resource and actual device, then we don't have to be concerned with scheduling right?
I'm not sure if clobbering resource requests in a webhook is possible though.
keps/sig-node/00014-resource-api.md
Outdated
- Yes, using node labels and NodeLabelSelectors. | ||
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations. | ||
- OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests. | ||
Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scale we are targeting? If generic scheduling features don't scale, then it's a problem that needs to be tackled separately.
cc @bsalamat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand your comment here. What this paragraph means is that if we extend container resource request API to directly specify their metadata requirements, scheduler needs to do the label selection matching on all of the compute resources in the cluster. But with ResourceClass, scheduler can cache compute resource to ResourceClass matching in its NodeInfo cache, and so the current PodFitsNodeResource evaluation will mostly stay the same without introducing new scaling concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it. So this is not really a point justifying the need of better resource APIs. It is about the internal design of such a new API.
keps/sig-node/00014-resource-api.md
Outdated
**How Resource classes can solve this:** | ||
The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling. | ||
|
||
- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use case seems too vague. TBH, the workloads that consume additional HW are specialized enough that it requires developer maintenance and cluster admins may not be able to homogenize different environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to disagree. I think a big value of Kubernetes is to allow separation of concern that application developers can focus on their own software with underlying infrastructure taken cared by cluster admins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh quoting Henry from Ebay in the support of workload portability:
HI Vish, Bobby, it is not exactly this requirement. Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center :
"accelerator": {
"type": "gpu",
"quantity": "1",
"labels": {
"product": "nvidia",
"family": "tesla",
"model": "m40"
}
}But when we go to other cloud such as Google or AWS they may not have the same cards.
So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards.
However, different cards such as AMD vs NVIDIA was not the goal because we know program code against NVIDIA cards will not work well if run with AMD cards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for Henry to respond to that thread. I think you all should solicit feedback from real users (I'm thinking ML WG, SIG BIG Data, etc.) to figure out if this is really feasible. No user that I have spoken to is ready to consume this level of sophistication today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are proposing an infrastructure building block here whose focusing users are infrastructure admins and developers who want to make their systems easier to use by hiding the underlying hardware details from end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details.
Yes and no. But mostly no. As mentioned by @vishh this is too vague and a lot of things that this statements covers is outside of K8s' scope.
- It might make sense when your users only request one GPU (but in that case does ECC on/off is a HW config?)
- When a user requests more than one GPU users should at the very least be able to specify if GPUs are linked through NVLINK
I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec.
What's blocking this today or what might be blocking this in the future?
When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.
How is this in the scope of Resource Classes
- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.<br/> | ||
**Motivation:** enables more compute resources and their advanced features on Kubernetes<br/> | ||
**Can this be solved without resource classes:**<br/> | ||
Yes, Using node labels and NodeLabelSelectors.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again seems pretty vague. Is this a real need today? The example mentioned below also doesn't seem realistic.
Are there use cases outside of GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This user story is mostly motivated by some past discussions on device plugin features requests, e.g., as @kad mentioned in kubernetes/kubernetes#59109 (comment)
I like the explicit API model that once ResourceClass is in place, Kubelet can pass ResourceClass name to a device plugin, and the device plugin can map that ResourceClass name to the special underlying resource metadata requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For kubernetes/kubernetes#59109 (comment) wouldn't Pod annotations suffice?
Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.
I don't think we are decided yet on what level of CPU specifics we want to expose to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After cpu manager's static policy, IMO, supporting features like isolated cores is a natural progression.
/cc @jeremyeder @kad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this feature is orthogonal to cpus and confuses the discussion.
authors: | ||
- "@vikaschoudhary16" | ||
- "@jiayingz" | ||
owning-sig: sig-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add sig-scheduling as well since there is quite some impact to scheduling, quota, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the KEP template and existing ones, I think it expects one owning sig. But agree a big part of this proposal is on scheduler side, and we should add it as participating-sigs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/cc |
22a9847
to
bbf8f0c
Compare
Workload portability is a feedback driven real use-case. |
bbf8f0c
to
a6af163
Compare
@jiayingz: GitHub didn't allow me to request PR reviews from the following users: hsaputra, bart0sh, fabiand. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
On Mon, Jul 30, 2018 at 3:33 AM Renaud Gaubert ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-node/00014-resource-api.md
<#2265 (comment)>:
> +to reserve expensive compute resources and control their access with resource
+quota, we propose to include a Priority field in ResourceClass API.
+By default, the value of this field is set to zero, but cluster admins can set
+it to a higher value, which would prevent its matching compute resources from
+being matched by lower priority ResourceClasses. i.e.,
+when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority.
+Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints.
+
+Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2.
+
+To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedDeviceIDs map[v1.ResourceName][]types.UID` field in ContainerSpec. Adding this field has been discussed as a possible solution to support other use cases, such as third-party resource monitoring and network device plugins. For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the requested number of devices from the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the AllocatedDeviceIDs field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName.
+
+A main reason we propose to have the scheduler make and record device level
+scheduling decision is so that the scheduler can maintain accurate resource acounting information.
+The matching from a ResourceClass to the underlying compute resources may change
+from two kinds of updates. First, cluster admins may want to add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority.
For ResourceClass add/delete/update handling, as long as scheduler has
already assigned the pod to a node with a ComputeResource, it doesn't
matter whether the old ResourceClass would be valid or not
I pretty much disagree here. If we consider that Resource Class should
represent a billable resource then:
As a user when I submit a deployment using a certain Resource Class I
expect that the underlying pods created by that deployment will always
refer to the same kind of device.
That Resource Class should not suddenly refer to something else or cost
more when I update the number of replicas (or for whatever reason a pod
might be created).
For billing use case, the billable resource is at ResourceClass level, not
on physical device. If the underlying hardware devices
are different enough that we want to charge them differently, we would then
want to create different ResourceClasses for them
and assign different resource quotas.
IMHO, edits should not be possible and Deletes should only be possible if
all references to that Resource Class have been deleted.
We may start with not allowing edit at the beginning but I think totally
disallowing this would quite impact user experience.
Considering e.g., a cluster has a ResourceClass defined to match different
types of the same hardware, it would be quite inconvinient
if cluster operators have to delete and recreate the ResourceClass to add a
new type of hardware.
For deletion case, agree we may consider to not allow delete unless all
running pods no longer refer to that ResourceClass.
However, this doesn't seem to affect the logic on scheduler part much given
that we need to support ResourceClass addition anyway,
which requires the scheduler to update its cached info on ComputeResource
to ResourceClass matching.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlG6DR3vB-sghSBJoyJDDNCIec5Ocks5uLuEFgaJpZM4Unbd4>
.
|
On Mon, Jul 30, 2018 at 4:02 AM Renaud Gaubert ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-node/00014-resource-api.md
<#2265 (comment)>:
> + - key: "Type"
+ operator: "Eq"
+ values:
+ - "nvidia-tesla-p100"
+ - key: "Nvlink"
+ operator: "Eq"
+ values:
+ - "true"
+```
+
+Now we face the question that whether the scheduler should allow Pods requesting
+"nvidia-p100" to land on a node in this new GPU node groups. So far, we have
+received different feedbacks on this question. In some use cases, users would
+like to have minimum matching behavior that as long as the underlying hardware
+matches the minimal requirements specified through ResourceClass contraints,
+they want to allow Pods to be scheduled on the hardware. On the other hand, some users desire to reserve expensive hardware resources for users who explicitly request them.
to hide underlying hardware difference to workloads that don't care about
such difference
Please highlight in the document which workloads and which hardware
specifically this requirement is for. Your code is usually already compiled
for a specific hardware.
Currently people are doing this through taint and toleration
I agree that Resource Class need a mechanism to signal that a Compute
Resource should belong to a Resource Class. Priority just doesn't seem to
be the right model.
It's not an intuitive model, it's completely error prone and dangerous and
absolutly not at the right granularity level.
I'd much rather we explore different simpler models:
- Overlapping Resource Class (multiple RC maps to the same Compute
Resource) with the ability to remove Comp Res from an RC
- Manually assign CRs to an RC when there is overlap
More generally if we image Resource Class as way to bill my users, I
expect them to be able to handle overlap. If I want to see how many GPUs I
have then I can just list the Compute Resources.
Could you add to the agenda to discuss whether we should omit priority
field in the initial design in
tomorrow's sig-node meeting? I think we are open to not adding this field
during the initial phase
if people have concerns on its usage and the use case is not considered as
must-have. We can
explore better way to support that use case after initial phase. Looks like
you have some models
in mind that you think would work better. Could you write down your design
in a document with
end-to-end support in more detail? Right now, it is hard to tell how your
proposed models would
work end-to-end.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AcIZlOXD2jnB_gd2-utAc58a_tMjIB3Pks5uLuergaJpZM4Unbd4>
.
|
keps/sig-node/00014-resource-api.md
Outdated
scenario as new resource properties are introduced into the system. Therefore we | ||
support this behavior by default. To also provide an easy way for cluster admins | ||
to reserve expensive compute resources and control their access with resource | ||
quota, we propose to include a Priority field in ResourceClass API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clarify the use cases that require non-overlapping resource classes?
- It seems this effect is achievable anyway if cluster admins design their resource class specs properly.
- With the described priority mechanism, to answer the question "why doesn't resource X on node Y match class Z?" users potentially have to inspect every resource class.
|
||
Possible fields we may consider to add later include: | ||
- `DeviceUnits resource.Quantity`. This field can be used to support fractional | ||
resource or infinite resource. In a more advanced use case, a device plugin may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite resources might come handy in the case when a "default" non-countable resource should be assigned to a container. This could be used for example to make a DP set a node-specific environment variable to a container.
Another use for infinite resources would be for metrics, if it's desired to know how much of a resource is being used but without intent of imposing a limit. For (non-)random example (not specifically applicable to this), using filesystem quotas to measure storage use by setting an effectively infinite quota. |
**How Resource classes can solve this:**<br/> | ||
Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources.</br> | ||
|
||
- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could very well be the frontend to previously-discussed CPU pool concept. It would just be configuration which would set the property (such as "these cpu cores belong to the AVX pool") and not any physical device property (such as the shared NUMA node). I think we need to keep the primary resources in mind for this proposal too, even if they are not part of the scope yet.
@vikaschoudhary16: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
**Motivation:** Empower enterprise customers to consume and manage non-primary resources easily, similar to how they consume and manage primary resources today.<br/> | ||
**Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.<br/> | ||
**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.<br/> | ||
**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.</br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'devicey-ness' of NICs is usually not the only considerations. Nobody cares about a hardware NIC without caring about what its connected to and the services its being provided by that connection. Or to put it more clearly: the characteristics of the NIC itself are only small part of the puzzle for network devices. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and it is allowed to use any of attributes to characterize a NIC as you want :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 Yep, I like the attributes approach. I'm still trying to understand how attributes of different 'types' are handled. Perhaps a more realistic example for network devices could be added here?
- key: "speed" | ||
operator: "Gt" | ||
values: | ||
- "40GBPS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to test my own comprehension here. Could one reasonably have:
- key: "networkservice"
operator: "Eq"
values:
- "radio-network"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, key can be any attribute name advertised by device plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 OK, so what entity has to understand that "radio-network" is of type string, instead of type network bandwidth, or other type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we could separate the resource phy nic from the logical network "red", correct?
My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.
But that might be so special that it's solved differently.
- key: "speed" | ||
operator: "Gt" | ||
values: | ||
- "40GBPS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also... "40GBPS" is not a simple number, how do you propose handling units wrt comparisons? What entity has to understand the units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be similar to how millicore units and memory units are handled in existing code already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vikaschoudhary16 Right... but some entity has to understand the units. I imagine there are many many kinds of units one might support. What entity has to understand these units? Effectively we are implicitly introducing a 'type' here by adding units... where before in the Device Plugin API we had an int. I'm just curious how new 'types' get added... and how we handle collisions of unit abbreviations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.
I am trying to understand how the proposal fits scenarios with network resources (NICs), and I have some comments / questions below.
Thanks in advance for answers, and thanks for working on the proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Just my 2ct added.
- key: "speed" | ||
operator: "Gt" | ||
values: | ||
- "40GBPS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we could separate the resource phy nic from the logical network "red", correct?
My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.
But that might be so special that it's solved differently.
spec: | ||
resourceName: "nvidia.com/gpu" | ||
resourceSelector: | ||
- matchExpressions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These attributes are the attributes exposed frm the DP to kubelet?
- key: "speed" | ||
operator: "Gt" | ||
values: | ||
- "40GBPS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.
Possible fields we may consider to add later include: | ||
- `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments. | ||
- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components. | ||
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This owuld be nice. Parameters to requests. But IIUIC it was explicitly exlucded from this proposal, correct?
- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components. | ||
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource. | ||
|
||
Note we intentially leave these fields out of the initial design to limit the scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… Ah yes, to answer myself.
- Another option is that Kubelet can evict the pods that are allocated with a non-existing ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade. | ||
- To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component. | ||
|
||
We propose to start with the first option, i.e., device property change requires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
``` | ||
|
||
On the other hand, cluster admin may want to allow pods requesting nvidia-p100 to use ecc p100 GPUs if they are idle, but relies on scheduler preemption to re-assign those devices to pods requesting nvidia-p100-ecc and with higher priority. Such use cases require the scheduler support on matching a ComputeResource to multiple qualified ResourceClasses. | ||
We feel this model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are to many lines breaks in the next three lines.
Is there also a KEP to track gRPC changes to support additional device types better (i.e. NICs)? |
/kind kep |
REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
KEPs have moved to k/enhancements. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I want to know the current status of this proposal, I did not find it in k/enhancements repo. And we have some requirements that are similar to this one. We are concerned about whether Kubernetes has a plan to implement device-based scheduling, instead of having devicemanager randomly select devices for Pod. We want to implement such requirements as scheduling based on GPU models. |
@angao: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@angao this proposal is currently put on hold. I am not aware of any plan to implement device-based scheduling in k8s. |
@jiayingz thx. I wonder if we can try to implement device-specific allocation by modifying the |
@angao there has been some effort on extending device plugin API for topology aware scheduling, but not sure whether this is something you are looking for. |
@derekwaynecarr @vishh @dchen1107 @jiayingz