|
| 1 | +# Resource Classes Proposal |
| 2 | + |
| 3 | + 1. [Abstract](#abstract) |
| 4 | + 2. [Motivation](#motivation) |
| 5 | + 3. [Use Cases](#use-cases) |
| 6 | + 4. [Objectives](#objectives) |
| 7 | + 5. [Non Objectives](#non-objectives) |
| 8 | + 6. [Resource Class](#resource-class) |
| 9 | + 7. [API Changes](#api-changes) |
| 10 | + 8. [Scheduler Changes](#sch-changes) |
| 11 | + 9. [Kubelet Changes](#kubelet-changes) |
| 12 | + 10. [Opaque Integer Resources](#oir) |
| 13 | + 11. [Future Scope](#future-scope) |
| 14 | + |
| 15 | +_Authors:_ |
| 16 | + |
| 17 | +* @vikaschoudhary16 - Vikas Choudhary <vichoudh@redhat.com> |
| 18 | +* @aveshagarwal - Avesh Agarwal <avagarwa@redhat.com> |
| 19 | + |
| 20 | +## Abstract |
| 21 | +In this document we will describe *resource classes* which is a new model to |
| 22 | +represent compute resources in Kubernetes. This document should be seen as a |
| 23 | +successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) |
| 24 | +and has a dependency on the same. |
| 25 | + |
| 26 | +## Motivation |
| 27 | +Compute resources in Kubernetes are represented as a key-value map with the key |
| 28 | +being a string and the value being a 'Quantity' which can (optionally) be |
| 29 | +fractional. The current model is great for supporting simple compute resources |
| 30 | +like CPU or Memory. The current model requires identity mapping between available |
| 31 | +resources and requested resources. Since 'CPU' and 'Memory' are resources that |
| 32 | +are available across all kubernetes deployments, the current user facing API |
| 33 | +(Pod Specification) remains portable. However the current model cannot support |
| 34 | +new resources like GPUs, ASICs, NICs, local storage, etc., which can potentially |
| 35 | +require non-identity mapping between available and requested resources, and |
| 36 | +require additional metadata about each resource to support heterogeneity and |
| 37 | +management at scale. |
| 38 | + |
| 39 | +_GPU Integration Example:_ |
| 40 | + * [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136) |
| 41 | + * [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116) |
| 42 | + |
| 43 | +_Kubernetes Meeting Notes On This:_ |
| 44 | + * [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#) |
| 45 | + * [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc) |
| 46 | + * [Extensible support for hardware devices in Kubernetes (join kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit) |
| 47 | + |
| 48 | +## Use Cases |
| 49 | + |
| 50 | + * I want to have a compute resource type which can be created with meaningful |
| 51 | + and portable names. This compute resource can hold additional metadata as well |
| 52 | + that will justify its name, for example: |
| 53 | + * `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB. |
| 54 | + * `fast.nic` is the name and associated metadata is bandwidth greater than |
| 55 | + 'B' gbps. |
| 56 | + * If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu' |
| 57 | + type device which has memory greater than or equal to 'X' GB, should be able |
| 58 | + to satisfy this request, independent of other device capabilities such as |
| 59 | + 'version' or 'nvlink locality' etc. |
| 60 | + * Similarly, if I request a resource `fast.nic`, any nic device with speed |
| 61 | + greater than 'B' gbps should be able to meet the request. |
| 62 | + * I want a rich metadata selection interface where operators like ‘Less Than’, |
| 63 | + ‘Greater Than’ and ‘In’, are supported on the compute resource metadata. |
| 64 | + |
| 65 | +## Objectives |
| 66 | + |
| 67 | +1. Define and add support in the API for a new type, *Resource Class*. |
| 68 | +2. Add support for *Resource Class* in the scheduler. |
| 69 | + |
| 70 | +## Non Objectives |
| 71 | +1. Discovery, advertisement, allocation/deallocation of devices is expected to |
| 72 | + be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files) |
| 73 | + |
| 74 | +## Resource Class |
| 75 | +*Resource Class* is a new type, objects of which provides abstraction over |
| 76 | +[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype). |
| 77 | +A *Resource Class* object selects devices using `matchExpressions`, a list of |
| 78 | +(operator, key, value). A *Resource Class* object selects a device if atleast |
| 79 | +one of the `matchExpressions` matches with device details. Within a matchExpression, |
| 80 | +all the (operator,key,value) are ANDed together to evaluate the result. |
| 81 | + |
| 82 | +YAML example 1: |
| 83 | +```yaml |
| 84 | +kind: ResourceClass |
| 85 | +metadata: |
| 86 | + name: nvidia.high.mem |
| 87 | +spec: |
| 88 | + resourceSelector: |
| 89 | + - |
| 90 | + matchExpressions: |
| 91 | + - |
| 92 | + key: "Kind" |
| 93 | + operator: "In" |
| 94 | + values: |
| 95 | + - "nvidia-gpu" |
| 96 | + key: "memory" |
| 97 | + operator: "Gt" |
| 98 | + values: |
| 99 | + - "30G" |
| 100 | +``` |
| 101 | +Above resource class will select all the nvidia-gpus which have memory greater |
| 102 | +than 30 GB. |
| 103 | +
|
| 104 | +YAML example 2: |
| 105 | +```yaml |
| 106 | +kind: ResourceClass |
| 107 | +metadata: |
| 108 | + name: hugepages-1gig |
| 109 | +spec: |
| 110 | + resourceSelector: |
| 111 | + - |
| 112 | + matchExpressions: |
| 113 | + - |
| 114 | + key: "Kind" |
| 115 | + operator: "In" |
| 116 | + values: |
| 117 | + - "huge-pages" |
| 118 | + key: "size" |
| 119 | + operator: "Gt" |
| 120 | + values: |
| 121 | + - "1G" |
| 122 | +``` |
| 123 | +Above resource class will select all the hugepages with size greater than |
| 124 | +equal to 1 GB. |
| 125 | +
|
| 126 | +YAML example 3: |
| 127 | +```yaml |
| 128 | +kind: ResourceClass |
| 129 | +metadata: |
| 130 | + name: fast.nic |
| 131 | +spec: |
| 132 | + resourceSelector: |
| 133 | + - |
| 134 | + matchExpressions: |
| 135 | + - |
| 136 | + key: "Kind" |
| 137 | + operator: "In" |
| 138 | + values: |
| 139 | + - "nic" |
| 140 | + key: "speed" |
| 141 | + operator: "In" |
| 142 | + values: |
| 143 | + - "40GBPS" |
| 144 | +``` |
| 145 | +Above resource class will select all the NICs with speed greater than equal to |
| 146 | +40 GBPS. |
| 147 | +
|
| 148 | +
|
| 149 | +## API Changes |
| 150 | +### ResourceClass |
| 151 | +
|
| 152 | +Internal representation of *Resource Class*: |
| 153 | +
|
| 154 | +```golang |
| 155 | +// +nonNamespaced=true |
| 156 | +// +genclient=true |
| 157 | + |
| 158 | +type ResourceClass struct { |
| 159 | + metav1.TypeMeta |
| 160 | + metav1.ObjectMeta |
| 161 | + // Spec defines resources required |
| 162 | + Spec ResourceClassSpec |
| 163 | + // +optional |
| 164 | + Status ResourceClassStatus |
| 165 | +} |
| 166 | +// Spec defines resources required |
| 167 | +type ResourceClassSpec struct { |
| 168 | + // Resource Selector selects resources |
| 169 | + ResourceSelector []ResourcePropertySelector |
| 170 | +} |
| 171 | + |
| 172 | +// A null or empty selector matches no resources |
| 173 | +type ResourcePropertySelector struct { |
| 174 | + // A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement |
| 175 | + MatchExpressions []ResourceSelectorRequirement |
| 176 | +} |
| 177 | + |
| 178 | +// A resource selector requirement is a selector that contains values, a key, and an operator |
| 179 | +// that relates the key and values |
| 180 | +type ResourceSelectorRequirement struct { |
| 181 | + // The label key that the selector applies to |
| 182 | + // +patchMergeKey=key |
| 183 | + // +patchStrategy=merge |
| 184 | + Key string |
| 185 | + // +optional |
| 186 | + Values []string |
| 187 | + // operator |
| 188 | + Operator ResourceSelectorOperator |
| 189 | +} |
| 190 | +type ResourceSelectorOperator string |
| 191 | + |
| 192 | +const ( |
| 193 | + ResourceSelectorOpIn ResourceSelectorOperator = "In" |
| 194 | + ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn" |
| 195 | + ResourceSelectorOpExists ResourceSelectorOperator = "Exists" |
| 196 | + ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist" |
| 197 | + ResourceSelectorOpGt ResourceSelectorOperator = "Gt" |
| 198 | + ResourceSelectorOpLt ResourceSelectorOperator = "Lt" |
| 199 | +) |
| 200 | +``` |
| 201 | +### ResourceClassStatus |
| 202 | +```golang |
| 203 | +type ResourceClassStatus struct { |
| 204 | + Allocatable resources.Quantity |
| 205 | + Request resources.Quantity |
| 206 | +} |
| 207 | +``` |
| 208 | +ResourceClass status is updated by the scheduler at: |
| 209 | +1. New *Resource Class* object creation. |
| 210 | +2. Node addition to the cluster. |
| 211 | +3. Node removal from the cluster. |
| 212 | +4. Pod creation if pod requests a resource class. |
| 213 | +5. Pod deletion if pod was consuming resource class. |
| 214 | + |
| 215 | +`ResourceClassStatus` serves the following two purposes: |
| 216 | +* Scheduler predicates evaluation while pod creation. For details, please refer |
| 217 | + further sections |
| 218 | +* User can view the current usage/availability details about the resource class |
| 219 | + using kubectl. |
| 220 | + |
| 221 | +### User story |
| 222 | +The administrator has deployed device plugins to support hardware present in the |
| 223 | +cluster. Device plugins, running on nodes, will update node status indicating |
| 224 | +the presence of this hardware. To offer this hardware to applications deployed |
| 225 | +on kubernetes in a portable way, the administrator creates a number of resource |
| 226 | +classes to represent that hardware. These resource classes will include metadata |
| 227 | +about the devices as selection criteria. |
| 228 | + |
| 229 | +1. A user submits a pod spec requesting 'X' resource classes. |
| 230 | +2. The scheduler filters the nodes which do not match the resource requests. |
| 231 | +3. scheduler selects a device for each resource class requested and annotates |
| 232 | + the pod object with device selection info. |
| 233 | +4. Kubelet reads the device request from pod annotation and calls `Allocate` on |
| 234 | + the matching Device Plugins. |
| 235 | +5. The user deletes the pod or the pod terminates |
| 236 | +6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate` |
| 237 | + on the matching Device Plugins |
| 238 | + |
| 239 | +In addition to node selection, the scheduler is also responsible for selecting a |
| 240 | +device that matches the resource class requested by the user. |
| 241 | + |
| 242 | +### Reason for not preferring device selection at kubelet |
| 243 | +Kubelet does not maintain any cache. Therefore to know the availability of a device, |
| 244 | +will have to calculate current total consumption by iterating over all the admitted |
| 245 | +pods running on the node. This is already done today while running predicates for |
| 246 | +each new incoming pod at kubelet. Even if we assume that scheduler cache and |
| 247 | +consumption state that is created at runtime for each pod, are exactly same, |
| 248 | +current api interfaces does not allow to pass selected device to container manager |
| 249 | +(where actually device plugin will be invoked from). This problem occurs because |
| 250 | +devices are determined internally from resource classes while other resource |
| 251 | +requests can be determined from pod object directly. |
| 252 | +To summarize, device selection at the kubelet can be done in one of the following |
| 253 | +two ways: |
| 254 | +* Select device at pod admission while applying predicates and change all api |
| 255 | + interfaces that are required to pass selected device to container runtime manager. |
| 256 | +* Create resource consumption state again at container manager and select device. |
| 257 | + |
| 258 | +None of the above approach seems cleaner than doing device selection at scheduler, |
| 259 | +which helps to retain cleaner api interfaces between packages. |
| 260 | + |
| 261 | +## Scheduler Changes |
| 262 | +Scheduler already listens and maintains state in the cache for any changes in |
| 263 | +node or pod objects. We will enhance the logic: |
| 264 | +1. To listen and maintain the state in cache for user created *Resource Class* objects. |
| 265 | +2. To look for device related details in node objects and maintain accounting for |
| 266 | + devices as well. |
| 267 | + |
| 268 | +From the events perspective, handling for the following events will be added/updated: |
| 269 | + |
| 270 | +### Resource Class Creation |
| 271 | +1. Init and add resource class info into local cache |
| 272 | +2. Iterate over all existing nodes in cache to figure out if there are devices |
| 273 | + on these nodes which are selectable by resource class. If found, update the |
| 274 | + resource class availability status in local cache. |
| 275 | +3. Patch the status of resource class api object with availability state in locyy |
| 276 | + cache |
| 277 | + |
| 278 | +### Resource Class Deletion |
| 279 | +Delete the resource class info from the cache. |
| 280 | + |
| 281 | +### Node Addition |
| 282 | +Scheduler already caches `NodeInfo`. Now additionally update device state: |
| 283 | +1. Check in the node status if any devices are present. |
| 284 | +2. For each device found, iterate over all existing resource classes in the cache |
| 285 | + to find resource classes which can select this particular device. For all |
| 286 | + such resource classes, update the availability state in the local cache. |
| 287 | +3. ResourceClass api object's status, `ResourceClassStatus` will be patched |
| 288 | + with the new “allocatable” vplue |
| 289 | + |
| 290 | +### Node Deletion |
| 291 | +If node has devices which are selectable by existing resource classes: |
| 292 | +1. Adjust resource class state in local cache. |
| 293 | +2. Update resource class status by patching api object. |
| 294 | + |
| 295 | +### Pod Creation |
| 296 | +1. Get the requested resource class name and quantity from pod spec. |
| 297 | +2. Select nodes by applying predicates according to requested quantity and Resource |
| 298 | + class's state present in the cache. |
| 299 | +3. On the selected node, select a Device from the stored devices info in cache |
| 300 | + after matching key,value from requested resource class. |
| 301 | +4. After device selection, update(decrease) 'Requested' for all the resource |
| 302 | + classes which could select this device in the cache. |
| 303 | +5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`. |
| 304 | +6. Add the pod reference in local DeviceToPod mapping structure in the cache. |
| 305 | +7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass' |
| 306 | + |
| 307 | +### Pod Delete |
| 308 | +1. Iterate over the all the devices on the at which pod was scheduled to and |
| 309 | + find out the devices being used by pod. |
| 310 | +2. For each device consumed by pod, update availability state of Resource classes |
| 311 | + which can select this device in the cache. |
| 312 | +3. Patch `ResourceClassStatus` with new availability state. |
| 313 | + |
| 314 | +## Kubelet Changes |
| 315 | +Update logic at container runtime manager to look for device annotations, |
| 316 | +prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device |
| 317 | +plugins. |
| 318 | + |
| 319 | +## Opaque Integer Resources |
| 320 | +This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature) |
| 321 | +(OIR). External agents can continue to attach additional 'opaque' resources to |
| 322 | +nodes, but the special naming scheme that is part of the current OIR approach |
| 323 | +will no longer be necessary. Any existing resource discovery tool which updates |
| 324 | +node objects with OIR, will adapt to update node status with devices instead. |
| 325 | + |
| 326 | + |
| 327 | +## Future Scope |
| 328 | +* RBAC: It can further be explored that how to tie resource classes with RBAC |
| 329 | + like any other existing API resource objects. |
| 330 | +* Nested Resource Classes: In future device plugins and resource classes can be |
| 331 | + extended to support the nested resource class functionality where one resource |
| 332 | + class could be comprised of a group of sub-resource classes. For example 'numa-node' |
| 333 | + resource class comprised of sub-resource classes, 'single-core'. |
0 commit comments