Skip to content

Commit a3d30c3

Browse files
Add Resource Class proposal
Signed-off-by: vikaschoudhary16 <vichoudh@redhat.com>
1 parent 3ef47e3 commit a3d30c3

File tree

1 file changed

+333
-0
lines changed

1 file changed

+333
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
# Resource Classes Proposal
2+
3+
1. [Abstract](#abstract)
4+
2. [Motivation](#motivation)
5+
3. [Use Cases](#use-cases)
6+
4. [Objectives](#objectives)
7+
5. [Non Objectives](#non-objectives)
8+
6. [Resource Class](#resource-class)
9+
7. [API Changes](#api-changes)
10+
8. [Scheduler Changes](#sch-changes)
11+
9. [Kubelet Changes](#kubelet-changes)
12+
10. [Opaque Integer Resources](#oir)
13+
11. [Future Scope](#future-scope)
14+
15+
_Authors:_
16+
17+
* @vikaschoudhary16 - Vikas Choudhary &lt;vichoudh@redhat.com&gt;
18+
* @aveshagarwal - Avesh Agarwal &lt;avagarwa@redhat.com&gt;
19+
20+
## Abstract
21+
In this document we will describe *resource classes* which is a new model to
22+
represent compute resources in Kubernetes. This document should be seen as a
23+
successor to [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
24+
and has a dependency on the same.
25+
26+
## Motivation
27+
Compute resources in Kubernetes are represented as a key-value map with the key
28+
being a string and the value being a 'Quantity' which can (optionally) be
29+
fractional. The current model is great for supporting simple compute resources
30+
like CPU or Memory. The current model requires identity mapping between available
31+
resources and requested resources. Since 'CPU' and 'Memory' are resources that
32+
are available across all kubernetes deployments, the current user facing API
33+
(Pod Specification) remains portable. However the current model cannot support
34+
new resources like GPUs, ASICs, NICs, local storage, etc., which can potentially
35+
require non-identity mapping between available and requested resources, and
36+
require additional metadata about each resource to support heterogeneity and
37+
management at scale.
38+
39+
_GPU Integration Example:_
40+
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136)
41+
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116)
42+
43+
_Kubernetes Meeting Notes On This:_
44+
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
45+
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
46+
* [Extensible support for hardware devices in Kubernetes (join kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
47+
48+
## Use Cases
49+
50+
* I want to have a compute resource type which can be created with meaningful
51+
and portable names. This compute resource can hold additional metadata as well
52+
that will justify its name, for example:
53+
* `nvidia.gpu.high.mem` is the name and metadata is memory greater than 'X' GB.
54+
* `fast.nic` is the name and associated metadata is bandwidth greater than
55+
'B' gbps.
56+
* If I request a resource `nvidia.gpu.high.mem` for my pod, any 'nvidia-gpu'
57+
type device which has memory greater than or equal to 'X' GB, should be able
58+
to satisfy this request, independent of other device capabilities such as
59+
'version' or 'nvlink locality' etc.
60+
* Similarly, if I request a resource `fast.nic`, any nic device with speed
61+
greater than 'B' gbps should be able to meet the request.
62+
* I want a rich metadata selection interface where operators like ‘Less Than’,
63+
‘Greater Than’ and ‘In’, are supported on the compute resource metadata.
64+
65+
## Objectives
66+
67+
1. Define and add support in the API for a new type, *Resource Class*.
68+
2. Add support for *Resource Class* in the scheduler.
69+
70+
## Non Objectives
71+
1. Discovery, advertisement, allocation/deallocation of devices is expected to
72+
be addressed by [device plugin proposal](https://github.com/kubernetes/community/pull/695/files)
73+
74+
## Resource Class
75+
*Resource Class* is a new type, objects of which provides abstraction over
76+
[Devices](https://github.com/RenaudWasTaken/community/blob/a7762d8fa80b9a805dbaa7deb510e95128905148/contributors/design-proposals/device-plugin.md#resourcetype).
77+
A *Resource Class* object selects devices using `matchExpressions`, a list of
78+
(operator, key, value). A *Resource Class* object selects a device if atleast
79+
one of the `matchExpressions` matches with device details. Within a matchExpression,
80+
all the (operator,key,value) are ANDed together to evaluate the result.
81+
82+
YAML example 1:
83+
```yaml
84+
kind: ResourceClass
85+
metadata:
86+
name: nvidia.high.mem
87+
spec:
88+
resourceSelector:
89+
-
90+
matchExpressions:
91+
-
92+
key: "Kind"
93+
operator: "In"
94+
values:
95+
- "nvidia-gpu"
96+
key: "memory"
97+
operator: "Gt"
98+
values:
99+
- "30G"
100+
```
101+
Above resource class will select all the nvidia-gpus which have memory greater
102+
than 30 GB.
103+
104+
YAML example 2:
105+
```yaml
106+
kind: ResourceClass
107+
metadata:
108+
name: hugepages-1gig
109+
spec:
110+
resourceSelector:
111+
-
112+
matchExpressions:
113+
-
114+
key: "Kind"
115+
operator: "In"
116+
values:
117+
- "huge-pages"
118+
key: "size"
119+
operator: "Gt"
120+
values:
121+
- "1G"
122+
```
123+
Above resource class will select all the hugepages with size greater than
124+
equal to 1 GB.
125+
126+
YAML example 3:
127+
```yaml
128+
kind: ResourceClass
129+
metadata:
130+
name: fast.nic
131+
spec:
132+
resourceSelector:
133+
-
134+
matchExpressions:
135+
-
136+
key: "Kind"
137+
operator: "In"
138+
values:
139+
- "nic"
140+
key: "speed"
141+
operator: "In"
142+
values:
143+
- "40GBPS"
144+
```
145+
Above resource class will select all the NICs with speed greater than equal to
146+
40 GBPS.
147+
148+
149+
## API Changes
150+
### ResourceClass
151+
152+
Internal representation of *Resource Class*:
153+
154+
```golang
155+
// +nonNamespaced=true
156+
// +genclient=true
157+
158+
type ResourceClass struct {
159+
metav1.TypeMeta
160+
metav1.ObjectMeta
161+
// Spec defines resources required
162+
Spec ResourceClassSpec
163+
// +optional
164+
Status ResourceClassStatus
165+
}
166+
// Spec defines resources required
167+
type ResourceClassSpec struct {
168+
// Resource Selector selects resources
169+
ResourceSelector []ResourcePropertySelector
170+
}
171+
172+
// A null or empty selector matches no resources
173+
type ResourcePropertySelector struct {
174+
// A list of resource/device selector requirements. ANDed from each ResourceSelectorRequirement
175+
MatchExpressions []ResourceSelectorRequirement
176+
}
177+
178+
// A resource selector requirement is a selector that contains values, a key, and an operator
179+
// that relates the key and values
180+
type ResourceSelectorRequirement struct {
181+
// The label key that the selector applies to
182+
// +patchMergeKey=key
183+
// +patchStrategy=merge
184+
Key string
185+
// +optional
186+
Values []string
187+
// operator
188+
Operator ResourceSelectorOperator
189+
}
190+
type ResourceSelectorOperator string
191+
192+
const (
193+
ResourceSelectorOpIn ResourceSelectorOperator = "In"
194+
ResourceSelectorOpNotIn ResourceSelectorOperator = "NotIn"
195+
ResourceSelectorOpExists ResourceSelectorOperator = "Exists"
196+
ResourceSelectorOpDoesNotExist ResourceSelectorOperator = "DoesNotExist"
197+
ResourceSelectorOpGt ResourceSelectorOperator = "Gt"
198+
ResourceSelectorOpLt ResourceSelectorOperator = "Lt"
199+
)
200+
```
201+
### ResourceClassStatus
202+
```golang
203+
type ResourceClassStatus struct {
204+
Allocatable resources.Quantity
205+
Request resources.Quantity
206+
}
207+
```
208+
ResourceClass status is updated by the scheduler at:
209+
1. New *Resource Class* object creation.
210+
2. Node addition to the cluster.
211+
3. Node removal from the cluster.
212+
4. Pod creation if pod requests a resource class.
213+
5. Pod deletion if pod was consuming resource class.
214+
215+
`ResourceClassStatus` serves the following two purposes:
216+
* Scheduler predicates evaluation while pod creation. For details, please refer
217+
further sections
218+
* User can view the current usage/availability details about the resource class
219+
using kubectl.
220+
221+
### User story
222+
The administrator has deployed device plugins to support hardware present in the
223+
cluster. Device plugins, running on nodes, will update node status indicating
224+
the presence of this hardware. To offer this hardware to applications deployed
225+
on kubernetes in a portable way, the administrator creates a number of resource
226+
classes to represent that hardware. These resource classes will include metadata
227+
about the devices as selection criteria.
228+
229+
1. A user submits a pod spec requesting 'X' resource classes.
230+
2. The scheduler filters the nodes which do not match the resource requests.
231+
3. scheduler selects a device for each resource class requested and annotates
232+
the pod object with device selection info.
233+
4. Kubelet reads the device request from pod annotation and calls `Allocate` on
234+
the matching Device Plugins.
235+
5. The user deletes the pod or the pod terminates
236+
6. Kubelet reads pod object annotation for devices consumed and calls `Deallocate`
237+
on the matching Device Plugins
238+
239+
In addition to node selection, the scheduler is also responsible for selecting a
240+
device that matches the resource class requested by the user.
241+
242+
### Reason for not preferring device selection at kubelet
243+
Kubelet does not maintain any cache. Therefore to know the availability of a device,
244+
will have to calculate current total consumption by iterating over all the admitted
245+
pods running on the node. This is already done today while running predicates for
246+
each new incoming pod at kubelet. Even if we assume that scheduler cache and
247+
consumption state that is created at runtime for each pod, are exactly same,
248+
current api interfaces does not allow to pass selected device to container manager
249+
(where actually device plugin will be invoked from). This problem occurs because
250+
devices are determined internally from resource classes while other resource
251+
requests can be determined from pod object directly.
252+
To summarize, device selection at the kubelet can be done in one of the following
253+
two ways:
254+
* Select device at pod admission while applying predicates and change all api
255+
interfaces that are required to pass selected device to container runtime manager.
256+
* Create resource consumption state again at container manager and select device.
257+
258+
None of the above approach seems cleaner than doing device selection at scheduler,
259+
which helps to retain cleaner api interfaces between packages.
260+
261+
## Scheduler Changes
262+
Scheduler already listens and maintains state in the cache for any changes in
263+
node or pod objects. We will enhance the logic:
264+
1. To listen and maintain the state in cache for user created *Resource Class* objects.
265+
2. To look for device related details in node objects and maintain accounting for
266+
devices as well.
267+
268+
From the events perspective, handling for the following events will be added/updated:
269+
270+
### Resource Class Creation
271+
1. Init and add resource class info into local cache
272+
2. Iterate over all existing nodes in cache to figure out if there are devices
273+
on these nodes which are selectable by resource class. If found, update the
274+
resource class availability status in local cache.
275+
3. Patch the status of resource class api object with availability state in locyy
276+
cache
277+
278+
### Resource Class Deletion
279+
Delete the resource class info from the cache.
280+
281+
### Node Addition
282+
Scheduler already caches `NodeInfo`. Now additionally update device state:
283+
1. Check in the node status if any devices are present.
284+
2. For each device found, iterate over all existing resource classes in the cache
285+
to find resource classes which can select this particular device. For all
286+
such resource classes, update the availability state in the local cache.
287+
3. ResourceClass api object's status, `ResourceClassStatus` will be patched
288+
with the new “allocatable” vplue
289+
290+
### Node Deletion
291+
If node has devices which are selectable by existing resource classes:
292+
1. Adjust resource class state in local cache.
293+
2. Update resource class status by patching api object.
294+
295+
### Pod Creation
296+
1. Get the requested resource class name and quantity from pod spec.
297+
2. Select nodes by applying predicates according to requested quantity and Resource
298+
class's state present in the cache.
299+
3. On the selected node, select a Device from the stored devices info in cache
300+
after matching key,value from requested resource class.
301+
4. After device selection, update(decrease) 'Requested' for all the resource
302+
classes which could select this device in the cache.
303+
5. Patch the resource class objects with new 'Requested' in the `ResourceClassStatus`.
304+
6. Add the pod reference in local DeviceToPod mapping structure in the cache.
305+
7. Patch the pod object with selected device annotation with prefix 'scheduler.alpha.kubernetes.io/resClass'
306+
307+
### Pod Delete
308+
1. Iterate over the all the devices on the at which pod was scheduled to and
309+
find out the devices being used by pod.
310+
2. For each device consumed by pod, update availability state of Resource classes
311+
which can select this device in the cache.
312+
3. Patch `ResourceClassStatus` with new availability state.
313+
314+
## Kubelet Changes
315+
Update logic at container runtime manager to look for device annotations,
316+
prefixed by 'scheduler.alpha.kubernetes.io/resClass' and call matching device
317+
plugins.
318+
319+
## Opaque Integer Resources
320+
This API will supercede the [Opaque Integer Resources](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature)
321+
(OIR). External agents can continue to attach additional 'opaque' resources to
322+
nodes, but the special naming scheme that is part of the current OIR approach
323+
will no longer be necessary. Any existing resource discovery tool which updates
324+
node objects with OIR, will adapt to update node status with devices instead.
325+
326+
327+
## Future Scope
328+
* RBAC: It can further be explored that how to tie resource classes with RBAC
329+
like any other existing API resource objects.
330+
* Nested Resource Classes: In future device plugins and resource classes can be
331+
extended to support the nested resource class functionality where one resource
332+
class could be comprised of a group of sub-resource classes. For example 'numa-node'
333+
resource class comprised of sub-resource classes, 'single-core'.

0 commit comments

Comments
 (0)