-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partitionable model with a common shared shape #31
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnbelamaric The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// | ||
// +listType=atomic | ||
// +optional | ||
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these become shared attributes that all partitions share? No ability to define e.g. the common set of attributes just for e.g. 1g.5gb
devices that exist at different memory slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, those still get layered in on a per-partition basis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I want to be able to define e.g.
deviceAttributeGroups:
- name: common-attributes
items: ...
- name: common-mig-1g.5gb-attributes
items: ...
- name: common-mig-2g.10gb-attributes
items: ...
And then layer them in as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, partitions individually have attributes. So, in the "device" that is presented to the claim side of the model, you have shape.Attributes + partition.Attributes + device.Attributes
available to CEL.
I do NOT go so far as to define a "PartitionShape" which common-izes attributes from a particular type of Partition. That is possible but there are diminishing returns. Option 5 does effectively do that (and more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a byproduct of nesting. If everything is top-level and referenceable by a name then it "just works" and is fully flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the "shape" captures the shared capacity / shared capacity consumed model as well as the common attributes once and re-use it across all devices. That's where the big size win comes in, not just with attributes but with the partition model. I am not clear on how you plan to capture both of those, it would be useful to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried top-level shared capacity groups in Option 2, it actually made things LARGER. Maybe I did it wrong? I don't think I did. I think because they are groups per-GPU, we list them over and over. Whereas with the nesting, we don't need to list them, because they are implicit in the structure.
Maybe we can do that implicit duplication with shared groups at the top level? I am not sure how we know which to do that with and which to as independent. Maybe a flag? We also need to capture the repeated "consumption" models, rather than repeat them on a per-device basis, to get the full effect.
I think that's much harder to understand, frankly, and the only thing I see us gaining is a theoretically ability to have an "independent" shared group that is across the slice not on the per-device basis. Which is something we could always add later at the slice level, if we needed it. We have no need for it at this point, that I know of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat off topic, but something you said here made me think of this....
I haven't discussed it much, but MIG actually supports a second level of nesting. I think it could still be modeled with this Option 4 (the second level of nesting still pulls from the same pool of resources).
However, we want to limit selection of devices from this second level of nesting to a single claim (because they all share access to the same physical memory). Meaning if someone requests e.g. a 1c.3g.20gb
device in two different claims, they MUST come from different parent MIGs.
i.e. the following, but guarantee that these devices are pulled from a MIG device that has not had other 2nd-level devices pulled from it yet.
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: mig-devices-and-nics
spec:
requests:
- name: mig-0
deviceClassName: gpu.nvidia.com
expression: "device['type'] == 'MIG' && device['profile'] == '1c.3g.20gb'"
- name: mig-1
deviceClassName: gpu.nvidia.com
expression: "device['type'] == 'MIG' && device['profile'] == '2c.3g.20gb'"
requirements:
- matchAttribute: parentMIG
We currently support this in our "classic" DRA driver (because we control the API through our CRD), but I don't see an easy way to support this restriction with structured parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thougts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do these nested partitions really need to be visible in K8s? Are they offered to different containers? Could this be part of the config?
Since they are not consumable across claims, I don't see how K8s even knowing about them matters; the scheduler wouldn't have any flexibility but to give a whole MIG anyway. The only flexibility would be association with different containers. But maybe the driver can handle that for these (don't associate the MIG with any specific container).
// | ||
// +listType=atomic | ||
// +optional | ||
SharedCapacityConsumed []SharedCapacity `json:"sharedCapacityConsumed,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I limited to the SharedCapacity
from the DeviceShape
where this partition is embedded? So, for example, if there were some "uncore" resources shared by all GPUs we couldn't consume that now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the canonical example, the DeviceShape
is a physical GPU, with a map of how it is partitonable. The Devices
then just list the physical GPUs. I am not clear on what shared resources might live outside the GPU. Are you saying you need to capture shared resources that are not part of the GPU and somehow span GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I'm saying. I don't necessarily see it for GPUs, but I'm sure there are other devices that will want the ability to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can add something later for that...
const ResourcePoolMaxDevices = 128 | ||
|
||
type DeviceShape struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still struggling to see how I share attributes across top level GPUs. Is that not supported in this model? Or do I need to put all of my top-level GPUs into one DeviceShape to take advantage of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is exactly one shape for any slice. Check out the example YAML.
Cardinality:
- One slice has one shape. The slice effectively encapsulates a model of device. If you have a node with several models of GPU, you would have a slice for each model.
- One shape has many partitions. This is the "map" of how this shape device can be partitioned.
- One slice has many devices, all of the same, single shape. Thus, the only thing you need in device entries are the device names and any device-specific attributes.
What you cannot do in this as it exists here is have per-device, per-partition attribute. But otherwise it is as expressive as the fully denormalized model.
If we want to pack things in tighter, we could allow multiple shapes per slice, but I don't see any value in that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need each MIG device to have a reference to its Parent GPU's UUID (so that they can be aligned on that). How does one do that with this model? Likewise, for PCIEroot, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.
Now, if we need to have a separate UUID for the MIG device and for the parent device, it is a bit more tricky, because the we may have name conflicts.
So, we may need a little more attribute manipulation magic here. And we may need a standard attribute to mean "partition". I prefer to see if we can get by with published conventions rather than manipulations. Another alternative is qualifying attributes by shape, partition, and device, but that seems very hard on the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.
The idea is that the scheduler sees an unrolled / flattened / projected version of this into a flat list of devices. So from the claim standpoint, it is equivalent to the current model.
// | ||
// +listType=atomic | ||
// +optional | ||
Partitions []DevicePartition `json:"partitions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the "simple" case, when I just have a bunch of devices that are not partitionable, what does my resourceSlice look like? Do I need to still fill this in with just 1 device (which is actually not partitionable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Nevermind, I see now -- this is not a replacement for Device
, it is something that defines how every device in this slice "looks", modulo its name and any custom attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
- name: uuid | ||
string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06 | ||
- bool: true | ||
name: mig-capable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get inherited by the partitions in the shape? IOW, do the MIG partitions indicate "mig-capable"?
Do we need some heritable attribtes and some not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. That's an issue. Similar to the issue Kevin raised. We might need some attribute "inheritance" decorations which I don't love.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I meant "class" vs. "instance". I'll try to sketch out what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I hacked up just a couple examples. The device shape is the class, and all attributes defined there are class attributes - they apply to ALL instances of that shape/partition. Attributes defined under devices
are instance attributes. If we find that there are a ton of instance attributes, this whole model breaks down, but I think it's a safe bet that MOST attributes are class.
Doing this makes me ask - doesn't each partition of each device need a unique name, which is used from claim status? Or is that just the device name + the partition name? E.g. gpu-0/whole
or gpu-7/mig-1g.5gb-0
--- /home/thockin/Downloads/orig.yaml 2024-06-15 14:26:21.270853559 -0700
+++ /home/thockin/Downloads/th-example.yaml 2024-06-15 14:44:01.254635103 -0700
@@ -495,19 +495,77 @@
- attributes:
- name: uuid
string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06
- - bool: true
- name: mig-capable
- - bool: false
- name: mig-enabled
name: gpu-0
+ partitions: ## NOTE: instantiates the partitions defined in the shape
+ - whole:
+ attributes:
+ - bool: true
+ name: mig-capable
+ ## NOTE: maybe just elide these if they are actually empty, or maybe
+ ## they all will actually have instancy things. Do they each need a
+ ## name, which is referenced from claim status?
+ - mig-1g.5gb-0: {}
+ - mig-1g.5gb-1: {}
+ - mig-1g.5gb-2: {}
+ - mig-1g.5gb-3: {}
+ - mig-1g.5gb-4: {}
+ - mig-1g.5gb-5: {}
+ - mig-1g.5gb-6: {}
+ - mig-2g.10gb-0: {}
+ - mig-2g.10gb-2: {}
+ - mig-2g.10gb-4: {}
+ - mig-3g.20gb-0: {}
+ - mig-3g.20gb-4: {}
+ - mig-4g.20gb-0: {}
+ - mig-7g.40gb-0: {}
+ - mig-1g.5gb-me-0: {}
+ - mig-1g.5gb-me-1: {}
+ - mig-1g.5gb-me-2: {}
+ - mig-1g.5gb-me-3: {}
+ - mig-1g.5gb-me-4: {}
+ - mig-1g.5gb-me-5: {}
+ - mig-1g.5gb-me-6: {}
+ - mig-1g.10gb-0: {}
+ - mig-1g.10gb-2: {}
+ - mig-1g.10gb-4: {}
+ - mig-1g.10gb-6: {}
- attributes:
- name: uuid
string: GPU-f87f3708-c8cf-4418-933b-2cb12f1f7d7c
- - bool: true
- name: mig-capable
- - bool: false
- name: mig-enabled
name: gpu-1
+ partitions: ## NOTE: instantiates the partitions defined in the shape
+ - whole:
+ attributes:
+ - bool: true
+ name: mig-capable
+ ## NOTE: maybe just elide these if they are actually empty, or maybe
+ ## they all will actually have instancy things. Do they each need a
+ ## name, which is referenced from claim status?
+ - mig-1g.5gb-0: {}
+ - mig-1g.5gb-1: {}
+ - mig-1g.5gb-2: {}
+ - mig-1g.5gb-3: {}
+ - mig-1g.5gb-4: {}
+ - mig-1g.5gb-5: {}
+ - mig-1g.5gb-6: {}
+ - mig-2g.10gb-0: {}
+ - mig-2g.10gb-2: {}
+ - mig-2g.10gb-4: {}
+ - mig-3g.20gb-0: {}
+ - mig-3g.20gb-4: {}
+ - mig-4g.20gb-0: {}
+ - mig-7g.40gb-0: {}
+ - mig-1g.5gb-me-0: {}
+ - mig-1g.5gb-me-1: {}
+ - mig-1g.5gb-me-2: {}
+ - mig-1g.5gb-me-3: {}
+ - mig-1g.5gb-me-4: {}
+ - mig-1g.5gb-me-5: {}
+ - mig-1g.5gb-me-6: {}
+ - mig-1g.10gb-0: {}
+ - mig-1g.10gb-2: {}
+ - mig-1g.10gb-4: {}
+ - mig-1g.10gb-6: {}
- attributes:
- name: uuid
string: GPU-d197c034-8516-4dda-a2cd-67a46457a454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.
If there are N partitions in shape, and M devices in the slice, the scheduler will be presented with N * M "devices" for satisfying claims from the slice.
This API is just what we need to store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we need to set values on a per device, per partition basis, these instances are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.
As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?
Unless we need to set values on a per device, per partition basis, these instances are redundant
But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?
It would be gpu-0-whole
. We could do ""
maybe. The weird thing is that it is still a "partition". I wonder if we can just make "whole device is allocatable" a flag and just use the device name? A sort of "pseudo" partition? Not sure.
But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.
Yes, that could work. We could also include the "whole device" as something special (as alluded to above). Or do both of those things, they are not mutually exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a rough outline of my thoughts on (yet another) way to do all of this and will start to refine it in the morning. I wanted to get it out before I go to bed though, in case any initial comments come in overnight that might influence how I push forward on it: #35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to take a look later tonight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't hate this (certainly smaller!) but I think it sort of conflates "classes" (in the OO sense) and instances. I wonder if there's a more explicit model that is 1 step less optimized but more specific...
I think DeviceShape is basically an OO class. I don't think we really conflate them. There is the shape (class) and the instances (devices). We do flatten that view for the scheduler. Though we don't have to - but I think it is easier for the scheduler and user if we do. |
All of these conceptually come down to "what do we normalize so we don't repeat ourselves" and "how do we generate the flat view from the normalized view". There may be a few more variants on that theme. |
9c26fa6
to
6c42a17
Compare
name: mig-enabled | ||
name: gpu-0 | ||
sharedCapacityConsumed: | ||
deviceShape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This imposes a subtle new requirement, right? All devices in a slice MUST be the same shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does. I made a similar comment somewhere as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Of course we could allow multiple shapes and reference then by name in the device list, but I don't see any real value in doing that over just having another slice.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Reworks #27 to instead have a common shape that contains the partitioning scheme, and then a list of devices. Common attributes can all be stored in the common shape, so this is much more compact.
This is option 4 in #20
Explaining a bit more here.
The claim side of things (scheduler for example) will see "devices" - options for allocation, with device names consisting of
slice.devices X slice.shape.partitions
. In other words, for eachslice.devices[*]
, you haveslice.shape.partitions
available for allocation. This more-of-less models the real world. Theslice.devices
just contains the actual physical GPUs; theslice.shape.partitions
contains the ways those partitionable GPUs can be consumed.In the "degenerate" case of no partitions, your "shape" amounts to a pool of common attributes that are common across all the devices.
You can set attributes at each of the levels: shape, partition, device. We need to decide on the order of precedence in the event that the same attribute is defined at each level. I would suggest the reverse order I just listed (device attributes overwrite partition attributes which overwrite shape attributes).