Partitionable model with a common shared shape #31

johnbelamaric · 2024-06-14T17:12:04Z

Reworks #27 to instead have a common shape that contains the partitioning scheme, and then a list of devices. Common attributes can all be stored in the common shape, so this is much more compact.

This is option 4 in #20

Explaining a bit more here.

The claim side of things (scheduler for example) will see "devices" - options for allocation, with device names consisting of slice.devices X slice.shape.partitions. In other words, for each slice.devices[*], you have slice.shape.partitions available for allocation. This more-of-less models the real world. The slice.devices just contains the actual physical GPUs; the slice.shape.partitions contains the ways those partitionable GPUs can be consumed.

In the "degenerate" case of no partitions, your "shape" amounts to a pool of common attributes that are common across all the devices.

You can set attributes at each of the levels: shape, partition, device. We need to decide on the order of precedence in the event that the same attribute is defined at each level. I would suggest the reverse order I just listed (device attributes overwrite partition attributes which overwrite shape attributes).

k8s-ci-robot · 2024-06-14T17:12:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

klueska · 2024-06-15T13:26:54Z

dra-evolution/pkg/api/capacity_types.go

+	//
+	// +listType=atomic
+	// +optional
+	Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`


So these become shared attributes that all partitions share? No ability to define e.g. the common set of attributes just for e.g. 1g.5gb devices that exist at different memory slices.

yeah, those still get layered in on a per-partition basis

Sure, but I want to be able to define e.g.

deviceAttributeGroups: - name: common-attributes items: ... - name: common-mig-1g.5gb-attributes items: ... - name: common-mig-2g.10gb-attributes items: ...

And then layer them in as appropriate.

In other words, partitions individually have attributes. So, in the "device" that is presented to the claim side of the model, you have shape.Attributes + partition.Attributes + device.Attributes available to CEL.

I do NOT go so far as to define a "PartitionShape" which common-izes attributes from a particular type of Partition. That is possible but there are diminishing returns. Option 5 does effectively do that (and more).

This is a byproduct of nesting. If everything is top-level and referenceable by a name then it "just works" and is fully flexible.

The goal of the "shape" captures the shared capacity / shared capacity consumed model as well as the common attributes once and re-use it across all devices. That's where the big size win comes in, not just with attributes but with the partition model. I am not clear on how you plan to capture both of those, it would be useful to see.

I tried top-level shared capacity groups in Option 2, it actually made things LARGER. Maybe I did it wrong? I don't think I did. I think because they are groups per-GPU, we list them over and over. Whereas with the nesting, we don't need to list them, because they are implicit in the structure.

Maybe we can do that implicit duplication with shared groups at the top level? I am not sure how we know which to do that with and which to as independent. Maybe a flag? We also need to capture the repeated "consumption" models, rather than repeat them on a per-device basis, to get the full effect.

I think that's much harder to understand, frankly, and the only thing I see us gaining is a theoretically ability to have an "independent" shared group that is across the slice not on the per-device basis. Which is something we could always add later at the slice level, if we needed it. We have no need for it at this point, that I know of.

This is somewhat off topic, but something you said here made me think of this....

I haven't discussed it much, but MIG actually supports a second level of nesting. I think it could still be modeled with this Option 4 (the second level of nesting still pulls from the same pool of resources).

However, we want to limit selection of devices from this second level of nesting to a single claim (because they all share access to the same physical memory). Meaning if someone requests e.g. a 1c.3g.20gb device in two different claims, they MUST come from different parent MIGs.

i.e. the following, but guarantee that these devices are pulled from a MIG device that has not had other 2nd-level devices pulled from it yet.

apiVersion: resource.k8s.io/v1alpha2 kind: ResourceClaim metadata: name: mig-devices-and-nics spec: requests: - name: mig-0 deviceClassName: gpu.nvidia.com expression: "device['type'] == 'MIG' && device['profile'] == '1c.3g.20gb'" - name: mig-1 deviceClassName: gpu.nvidia.com expression: "device['type'] == 'MIG' && device['profile'] == '2c.3g.20gb'" requirements: - matchAttribute: parentMIG

We currently support this in our "classic" DRA driver (because we control the API through our CRD), but I don't see an easy way to support this restriction with structured parameters.

Any thougts on this?

So, do these nested partitions really need to be visible in K8s? Are they offered to different containers? Could this be part of the config?

Since they are not consumable across claims, I don't see how K8s even knowing about them matters; the scheduler wouldn't have any flexibility but to give a whole MIG anyway. The only flexibility would be association with different containers. But maybe the driver can handle that for these (don't associate the MIG with any specific container).

klueska · 2024-06-15T13:28:02Z

dra-evolution/pkg/api/capacity_types.go

+	//
+	// +listType=atomic
+	// +optional
+	SharedCapacityConsumed []SharedCapacity `json:"sharedCapacityConsumed,omitempty"`


Am I limited to the SharedCapacity from the DeviceShape where this partition is embedded? So, for example, if there were some "uncore" resources shared by all GPUs we couldn't consume that now?

In the canonical example, the DeviceShape is a physical GPU, with a map of how it is partitonable. The Devices then just list the physical GPUs. I am not clear on what shared resources might live outside the GPU. Are you saying you need to capture shared resources that are not part of the GPU and somehow span GPUs?

Yes, that is what I'm saying. I don't necessarily see it for GPUs, but I'm sure there are other devices that will want the ability to do this.

Then we can add something later for that...

klueska · 2024-06-15T17:08:20Z

dra-evolution/pkg/api/capacity_types.go

 const ResourcePoolMaxDevices = 128

+type DeviceShape struct {


I'm still struggling to see how I share attributes across top level GPUs. Is that not supported in this model? Or do I need to put all of my top-level GPUs into one DeviceShape to take advantage of this?

There is exactly one shape for any slice. Check out the example YAML.

Cardinality:

One slice has one shape. The slice effectively encapsulates a model of device. If you have a node with several models of GPU, you would have a slice for each model.

One shape has many partitions. This is the "map" of how this shape device can be partitioned.

One slice has many devices, all of the same, single shape. Thus, the only thing you need in device entries are the device names and any device-specific attributes.

What you cannot do in this as it exists here is have per-device, per-partition attribute. But otherwise it is as expressive as the fully denormalized model.

If we want to pack things in tighter, we could allow multiple shapes per slice, but I don't see any value in that.

I need each MIG device to have a reference to its Parent GPU's UUID (so that they can be aligned on that). How does one do that with this model? Likewise, for PCIEroot, etc.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

Now, if we need to have a separate UUID for the MIG device and for the parent device, it is a bit more tricky, because the we may have name conflicts.

So, we may need a little more attribute manipulation magic here. And we may need a standard attribute to mean "partition". I prefer to see if we can get by with published conventions rather than manipulations. Another alternative is qualifying attributes by shape, partition, and device, but that seems very hard on the user.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

The idea is that the scheduler sees an unrolled / flattened / projected version of this into a flat list of devices. So from the claim standpoint, it is equivalent to the current model.

klueska · 2024-06-15T17:10:50Z

dra-evolution/pkg/api/capacity_types.go

+	//
+	// +listType=atomic
+	// +optional
+	Partitions []DevicePartition `json:"partitions,omitempty"`


In the "simple" case, when I just have a bunch of devices that are not partitionable, what does my resourceSlice look like? Do I need to still fill this in with just 1 device (which is actually not partitionable)?

Edit: Nevermind, I see now -- this is not a replacement for Device, it is something that defines how every device in this slice "looks", modulo its name and any custom attributes.

thockin · 2024-06-15T20:12:45Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

+      - name: uuid
+        string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06
+      - bool: true
+        name: mig-capable


Does this get inherited by the partitions in the shape? IOW, do the MIG partitions indicate "mig-capable"?

Do we need some heritable attribtes and some not?

Yeah. That's an issue. Similar to the issue Kevin raised. We might need some attribute "inheritance" decorations which I don't love.

This is what I meant "class" vs. "instance". I'll try to sketch out what I mean.

Here, I hacked up just a couple examples. The device shape is the class, and all attributes defined there are class attributes - they apply to ALL instances of that shape/partition. Attributes defined under devices are instance attributes. If we find that there are a ton of instance attributes, this whole model breaks down, but I think it's a safe bet that MOST attributes are class.

Doing this makes me ask - doesn't each partition of each device need a unique name, which is used from claim status? Or is that just the device name + the partition name? E.g. gpu-0/whole or gpu-7/mig-1g.5gb-0

--- /home/thockin/Downloads/orig.yaml 2024-06-15 14:26:21.270853559 -0700 +++ /home/thockin/Downloads/th-example.yaml 2024-06-15 14:44:01.254635103 -0700 @@ -495,19 +495,77 @@ - attributes: - name: uuid string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06 - - bool: true - name: mig-capable - - bool: false - name: mig-enabled name: gpu-0 + partitions: ## NOTE: instantiates the partitions defined in the shape + - whole: + attributes: + - bool: true + name: mig-capable + ## NOTE: maybe just elide these if they are actually empty, or maybe + ## they all will actually have instancy things. Do they each need a + ## name, which is referenced from claim status? + - mig-1g.5gb-0: {} + - mig-1g.5gb-1: {} + - mig-1g.5gb-2: {} + - mig-1g.5gb-3: {} + - mig-1g.5gb-4: {} + - mig-1g.5gb-5: {} + - mig-1g.5gb-6: {} + - mig-2g.10gb-0: {} + - mig-2g.10gb-2: {} + - mig-2g.10gb-4: {} + - mig-3g.20gb-0: {} + - mig-3g.20gb-4: {} + - mig-4g.20gb-0: {} + - mig-7g.40gb-0: {} + - mig-1g.5gb-me-0: {} + - mig-1g.5gb-me-1: {} + - mig-1g.5gb-me-2: {} + - mig-1g.5gb-me-3: {} + - mig-1g.5gb-me-4: {} + - mig-1g.5gb-me-5: {} + - mig-1g.5gb-me-6: {} + - mig-1g.10gb-0: {} + - mig-1g.10gb-2: {} + - mig-1g.10gb-4: {} + - mig-1g.10gb-6: {} - attributes: - name: uuid string: GPU-f87f3708-c8cf-4418-933b-2cb12f1f7d7c - - bool: true - name: mig-capable - - bool: false - name: mig-enabled name: gpu-1 + partitions: ## NOTE: instantiates the partitions defined in the shape + - whole: + attributes: + - bool: true + name: mig-capable + ## NOTE: maybe just elide these if they are actually empty, or maybe + ## they all will actually have instancy things. Do they each need a + ## name, which is referenced from claim status? + - mig-1g.5gb-0: {} + - mig-1g.5gb-1: {} + - mig-1g.5gb-2: {} + - mig-1g.5gb-3: {} + - mig-1g.5gb-4: {} + - mig-1g.5gb-5: {} + - mig-1g.5gb-6: {} + - mig-2g.10gb-0: {} + - mig-2g.10gb-2: {} + - mig-2g.10gb-4: {} + - mig-3g.20gb-0: {} + - mig-3g.20gb-4: {} + - mig-4g.20gb-0: {} + - mig-7g.40gb-0: {} + - mig-1g.5gb-me-0: {} + - mig-1g.5gb-me-1: {} + - mig-1g.5gb-me-2: {} + - mig-1g.5gb-me-3: {} + - mig-1g.5gb-me-4: {} + - mig-1g.5gb-me-5: {} + - mig-1g.5gb-me-6: {} + - mig-1g.10gb-0: {} + - mig-1g.10gb-2: {} + - mig-1g.10gb-4: {} + - mig-1g.10gb-6: {} - attributes: - name: uuid string: GPU-d197c034-8516-4dda-a2cd-67a46457a454

Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.

If there are N partitions in shape, and M devices in the slice, the scheduler will be presented with N * M "devices" for satisfying claims from the slice.

This API is just what we need to store.

Unless we need to set values on a per device, per partition basis, these instances are redundant.

Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.

As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?

Unless we need to set values on a per device, per partition basis, these instances are redundant

But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.

As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?

See https://github.com/kubernetes-sigs/wg-device-management/pull/31/files#diff-a53461b1c5a20f7a5b62cecae4b45445b5c874d8d192093ade65eb9cc06e3ec1R112

It would be gpu-0-whole. We could do "" maybe. The weird thing is that it is still a "partition". I wonder if we can just make "whole device is allocatable" a flag and just use the device name? A sort of "pseudo" partition? Not sure.

But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.

Yes, that could work. We could also include the "whole device" as something special (as alluded to above). Or do both of those things, they are not mutually exclusive.

I've pushed a rough outline of my thoughts on (yet another) way to do all of this and will start to refine it in the morning. I wanted to get it out before I go to bed though, in case any initial comments come in overnight that might influence how I push forward on it: #35

I'll try to take a look later tonight

thockin

I don't hate this (certainly smaller!) but I think it sort of conflates "classes" (in the OO sense) and instances. I wonder if there's a more explicit model that is 1 step less optimized but more specific...

johnbelamaric · 2024-06-15T20:26:07Z

I think DeviceShape is basically an OO class. I don't think we really conflate them. There is the shape (class) and the instances (devices). We do flatten that view for the scheduler. Though we don't have to - but I think it is easier for the scheduler and user if we do.

johnbelamaric · 2024-06-15T20:28:02Z

All of these conceptually come down to "what do we normalize so we don't repeat ourselves" and "how do we generate the flat view from the normalized view". There may be a few more variants on that theme.

thockin · 2024-06-16T19:58:35Z

dra-evolution/testdata/pools-two-nodes-dgxa100.yaml

-        name: mig-enabled
-      name: gpu-0
-      sharedCapacityConsumed:
+    deviceShape:


This imposes a subtle new requirement, right? All devices in a slice MUST be the same shape.

Yes it does. I made a similar comment somewhere as well.

Yes. Of course we could allow multiple shapes and reference then by name in the device list, but I don't see any real value in doing that over just having another slice.

k8s-triage-robot · 2024-09-15T01:09:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-15T01:47:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-11-14T01:54:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-11-14T01:54:08Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from klueska and pohly June 14, 2024 17:12

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 14, 2024

This was referenced Jun 14, 2024

dra-evolution: partitioning of devices #20

Closed

Partitionable model with generated partitions #32

Closed

klueska reviewed Jun 15, 2024

View reviewed changes

johnbelamaric mentioned this pull request Jun 15, 2024

Partitionable with common attributes #30

Closed

thockin reviewed Jun 15, 2024

View reviewed changes

johnbelamaric added 2 commits June 15, 2024 18:43

Partitionable model with a common shared shape

6a06b92

Fix partition names

6c42a17

johnbelamaric force-pushed the partitionable-4 branch from 9c26fa6 to 6c42a17 Compare June 16, 2024 01:45

thockin reviewed Jun 16, 2024

View reviewed changes

johnbelamaric mentioned this pull request Jun 18, 2024

Update API for 1.31 without partitionable model #36

Closed

klueska mentioned this pull request Aug 30, 2024

DRA: Add support for partitionable devices kubernetes/enhancements#4815

Open

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2024

k8s-ci-robot closed this Nov 14, 2024

Partitionable model with a common shared shape #31

Partitionable model with a common shared shape #31

Conversation

johnbelamaric commented Jun 14, 2024 • edited Loading

k8s-ci-robot commented Jun 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

johnbelamaric Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment • edited Loading

Choose a reason for hiding this comment

johnbelamaric commented Jun 15, 2024

johnbelamaric commented Jun 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Sep 15, 2024

k8s-triage-robot commented Oct 15, 2024

k8s-triage-robot commented Nov 14, 2024

k8s-ci-robot commented Nov 14, 2024

johnbelamaric commented Jun 14, 2024 •

edited

Loading

klueska Jun 15, 2024 •

edited

Loading

johnbelamaric Jun 15, 2024 •

edited

Loading

klueska Jun 15, 2024 •

edited

Loading

johnbelamaric Jun 15, 2024 •

edited

Loading

thockin Jun 15, 2024 •

edited

Loading

thockin Jun 15, 2024 •

edited

Loading

thockin left a comment •

edited

Loading