-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>
- Loading branch information
1 parent
3ef47e3
commit a3d30c3
Showing
1 changed file
with
333 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,333 @@ | ||
# Resource Classes Proposal | ||
|
||
1. [Abstract](#abstract) | ||
2. [Motivation](#motivation) | ||
3. [Use Cases](#use-cases) | ||
4. [Objectives](#objectives) | ||
5. [Non Objectives](#non-objectives) | ||
6. [Resource Class](#resource-class) | ||
7. [API Changes](#api-changes) | ||
8. [Scheduler Changes](#sch-changes) | ||
9. [Kubelet Changes](#kubelet-changes) | ||
10. [Opaque Integer Resources](#oir) | ||
11. [Future Scope](#future-scope) | ||
|
||
_Authors:_ | ||
|
||
* @vikaschoudhary16 - Vikas Choudhary <vichoudh@redhat.com> | ||
* @aveshagarwal - Avesh Agarwal <avagarwa@redhat.com> | ||
|
||
## Abstract | ||
In this document we will describe *resource classes* which is a new model to | ||
represent compute resources in Kubernetes. This document should be seen as a | ||
successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
and has a dependency on the same. | ||
|
||
## Motivation | ||
Compute resources in Kubernetes are represented as a key-value map with the key | ||
being a string and the value being a 'Quantity' which can (optionally) be | ||
fractional. The current model is great for supporting simple compute resources | ||
like CPU or Memory. The current model requires identity mapping between available | ||
resources and requested resources. Since 'CPU' and 'Memory' are resources that | ||
are available across all kubernetes deployments, the current user facing API | ||
(Pod Specification) remains portable. However the current model cannot support | ||
new resources like GPUs, ASICs, NICs, local storage, etc., which can potentially | ||
require non-identity mapping between available and requested resources, and | ||
require additional metadata about each resource to support heterogeneity and | ||
management at scale. | ||
|
||
_GPU Integration Example:_ | ||
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136) | ||
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116) | ||
|
||
_Kubernetes Meeting Notes On This:_ | ||
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) | ||
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) | ||
* [Extensible support for hardware devices in Kubernetes (join kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) | ||
|
||
## Use Cases | ||
|
||
* I want to have a compute resource type which can be created with meaningful | ||
and portable names. This compute resource can hold additional metadata as well | ||
that will justify its name, for example: | ||
* `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB. | ||
* `fast.nic` is the name and associated metadata is bandwidth greater than | ||
'B' gbps. | ||
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu' | ||
type device which has memory greater than or equal to 'X' GB, should be able | ||
to satisfy this request, independent of other device capabilities such as | ||
'version' or 'nvlink locality' etc. | ||
* Similarly, if I request a resource `fast.nic`, any nic device with speed | ||
greater than 'B' gbps should be able to meet the request. | ||
* I want a rich metadata selection interface where operators like ‘Less Than’, | ||
‘Greater Than’ and ‘In’, are supported on the compute resource metadata. | ||
|
||
## Objectives | ||
|
||
1. Define and add support in the API for a new type, *Resource Class*. | ||
2. Add support for *Resource Class* in the scheduler. | ||
|
||
## Non Objectives | ||
1. Discovery, advertisement, allocation/deallocation of devices is expected to | ||
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) | ||
|
||
## Resource Class | ||
*Resource Class* is a new type, objects of which provides abstraction over | ||
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype). | ||
A *Resource Class* object selects devices using `matchExpressions`, a list of | ||
(operator, key, value). A *Resource Class* object selects a device if atleast | ||
one of the `matchExpressions` matches with device details. Within a matchExpression, | ||
all the (operator,key,value) are ANDed together to evaluate the result. | ||
|
||
YAML example 1: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: nvidia.high.mem | ||
spec: | ||
resourceSelector: | ||
- | ||
matchExpressions: | ||
- | ||
key: "Kind" | ||
operator: "In" | ||
values: | ||
- "nvidia-gpu" | ||
key: "memory" | ||
operator: "Gt" | ||
values: | ||
- "30G" | ||
``` | ||
Above resource class will select all the nvidia-gpus which have memory greater | ||
than 30 GB. | ||
YAML example 2: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: hugepages-1gig | ||
spec: | ||
resourceSelector: | ||
- | ||
matchExpressions: | ||
- | ||
key: "Kind" | ||
operator: "In" | ||
values: | ||
- "huge-pages" | ||
key: "size" | ||
operator: "Gt" | ||
values: | ||
- "1G" | ||
``` | ||
Above resource class will select all the hugepages with size greater than | ||
equal to 1 GB. | ||
YAML example 3: | ||
```yaml | ||
kind: ResourceClass | ||
metadata: | ||
name: fast.nic | ||
spec: | ||
resourceSelector: | ||
- | ||
matchExpressions: | ||
- | ||
key: "Kind" | ||
operator: "In" | ||
values: | ||
- "nic" | ||
key: "speed" | ||
operator: "In" | ||
values: | ||
- "40GBPS" | ||
``` | ||
Above resource class will select all the NICs with speed greater than equal to | ||
40 GBPS. | ||
## API Changes | ||
### ResourceClass | ||
Internal representation of *Resource Class*: | ||
```golang | ||
// +nonNamespaced=true | ||
// +genclient=true | ||
|
||
type ResourceClass struct { | ||
metav1.TypeMeta | ||
metav1.ObjectMeta | ||
// Spec defines resources required | ||
Spec ResourceClassSpec | ||
// +optional | ||
Status ResourceClassStatus | ||
} | ||
// Spec defines resources required | ||
type ResourceClassSpec struct { | ||
// Resource Selector selects resources | ||
ResourceSelector []ResourcePropertySelector | ||
} | ||
|
||
// A null or empty selector matches no resources | ||
type ResourcePropertySelector struct { | ||
// A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement | ||
MatchExpressions []ResourceSelectorRequirement | ||
} | ||
|
||
// A resource selector requirement is a selector that contains values, a key, and an operator | ||
// that relates the key and values | ||
type ResourceSelectorRequirement struct { | ||
// The label key that the selector applies to | ||
// +patchMergeKey=key | ||
// +patchStrategy=merge | ||
Key string | ||
// +optional | ||
Values []string | ||
// operator | ||
Operator ResourceSelectorOperator | ||
} | ||
type ResourceSelectorOperator string | ||
|
||
const ( | ||
ResourceSelectorOpIn ResourceSelectorOperator = "In" | ||
ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn" | ||
ResourceSelectorOpExists ResourceSelectorOperator = "Exists" | ||
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist" | ||
ResourceSelectorOpGt ResourceSelectorOperator = "Gt" | ||
ResourceSelectorOpLt ResourceSelectorOperator = "Lt" | ||
) | ||
``` | ||
### ResourceClassStatus | ||
```golang | ||
type ResourceClassStatus struct { | ||
Allocatable resources.Quantity | ||
Request resources.Quantity | ||
} | ||
``` | ||
ResourceClass status is updated by the scheduler at: | ||
1. New *Resource Class* object creation. | ||
2. Node addition to the cluster. | ||
3. Node removal from the cluster. | ||
4. Pod creation if pod requests a resource class. | ||
5. Pod deletion if pod was consuming resource class. | ||
|
||
`ResourceClassStatus` serves the following two purposes: | ||
* Scheduler predicates evaluation while pod creation. For details, please refer | ||
further sections | ||
* User can view the current usage/availability details about the resource class | ||
using kubectl. | ||
|
||
### User story | ||
The administrator has deployed device plugins to support hardware present in the | ||
cluster. Device plugins, running on nodes, will update node status indicating | ||
the presence of this hardware. To offer this hardware to applications deployed | ||
on kubernetes in a portable way, the administrator creates a number of resource | ||
classes to represent that hardware. These resource classes will include metadata | ||
about the devices as selection criteria. | ||
|
||
1. A user submits a pod spec requesting 'X' resource classes. | ||
2. The scheduler filters the nodes which do not match the resource requests. | ||
3. scheduler selects a device for each resource class requested and annotates | ||
the pod object with device selection info. | ||
4. Kubelet reads the device request from pod annotation and calls `Allocate` on | ||
the matching Device Plugins. | ||
5. The user deletes the pod or the pod terminates | ||
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` | ||
on the matching Device Plugins | ||
|
||
In addition to node selection, the scheduler is also responsible for selecting a | ||
device that matches the resource class requested by the user. | ||
|
||
### Reason for not preferring device selection at kubelet | ||
Kubelet does not maintain any cache. Therefore to know the availability of a device, | ||
will have to calculate current total consumption by iterating over all the admitted | ||
pods running on the node. This is already done today while running predicates for | ||
each new incoming pod at kubelet. Even if we assume that scheduler cache and | ||
consumption state that is created at runtime for each pod, are exactly same, | ||
current api interfaces does not allow to pass selected device to container manager | ||
(where actually device plugin will be invoked from). This problem occurs because | ||
devices are determined internally from resource classes while other resource | ||
requests can be determined from pod object directly. | ||
To summarize, device selection at the kubelet can be done in one of the following | ||
two ways: | ||
* Select device at pod admission while applying predicates and change all api | ||
interfaces that are required to pass selected device to container runtime manager. | ||
* Create resource consumption state again at container manager and select device. | ||
|
||
None of the above approach seems cleaner than doing device selection at scheduler, | ||
which helps to retain cleaner api interfaces between packages. | ||
|
||
## Scheduler Changes | ||
Scheduler already listens and maintains state in the cache for any changes in | ||
node or pod objects. We will enhance the logic: | ||
1. To listen and maintain the state in cache for user created *Resource Class* objects. | ||
2. To look for device related details in node objects and maintain accounting for | ||
devices as well. | ||
|
||
From the events perspective, handling for the following events will be added/updated: | ||
|
||
### Resource Class Creation | ||
1. Init and add resource class info into local cache | ||
2. Iterate over all existing nodes in cache to figure out if there are devices | ||
on these nodes which are selectable by resource class. If found, update the | ||
resource class availability status in local cache. | ||
3. Patch the status of resource class api object with availability state in locyy | ||
cache | ||
|
||
### Resource Class Deletion | ||
Delete the resource class info from the cache. | ||
|
||
### Node Addition | ||
Scheduler already caches `NodeInfo`. Now additionally update device state: | ||
1. Check in the node status if any devices are present. | ||
2. For each device found, iterate over all existing resource classes in the cache | ||
to find resource classes which can select this particular device. For all | ||
such resource classes, update the availability state in the local cache. | ||
3. ResourceClass api object's status, `ResourceClassStatus` will be patched | ||
with the new “allocatable” vplue | ||
|
||
### Node Deletion | ||
If node has devices which are selectable by existing resource classes: | ||
1. Adjust resource class state in local cache. | ||
2. Update resource class status by patching api object. | ||
|
||
### Pod Creation | ||
1. Get the requested resource class name and quantity from pod spec. | ||
2. Select nodes by applying predicates according to requested quantity and Resource | ||
class's state present in the cache. | ||
3. On the selected node, select a Device from the stored devices info in cache | ||
after matching key,value from requested resource class. | ||
4. After device selection, update(decrease) 'Requested' for all the resource | ||
classes which could select this device in the cache. | ||
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`. | ||
6. Add the pod reference in local DeviceToPod mapping structure in the cache. | ||
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass' | ||
|
||
### Pod Delete | ||
1. Iterate over the all the devices on the at which pod was scheduled to and | ||
find out the devices being used by pod. | ||
2. For each device consumed by pod, update availability state of Resource classes | ||
which can select this device in the cache. | ||
3. Patch `ResourceClassStatus` with new availability state. | ||
|
||
## Kubelet Changes | ||
Update logic at container runtime manager to look for device annotations, | ||
prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device | ||
plugins. | ||
|
||
## Opaque Integer Resources | ||
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) | ||
(OIR). External agents can continue to attach additional 'opaque' resources to | ||
nodes, but the special naming scheme that is part of the current OIR approach | ||
will no longer be necessary. Any existing resource discovery tool which updates | ||
node objects with OIR, will adapt to update node status with devices instead. | ||
|
||
|
||
## Future Scope | ||
* RBAC: It can further be explored that how to tie resource classes with RBAC | ||
like any other existing API resource objects. | ||
* Nested Resource Classes: In future device plugins and resource classes can be | ||
extended to support the nested resource class functionality where one resource | ||
class could be comprised of a group of sub-resource classes. For example 'numa-node' | ||
resource class comprised of sub-resource classes, 'single-core'. |