Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitionable model with a common shared shape #31

Closed

Conversation

johnbelamaric
Copy link
Contributor

@johnbelamaric johnbelamaric commented Jun 14, 2024

Reworks #27 to instead have a common shape that contains the partitioning scheme, and then a list of devices. Common attributes can all be stored in the common shape, so this is much more compact.

This is option 4 in #20

Explaining a bit more here.

The claim side of things (scheduler for example) will see "devices" - options for allocation, with device names consisting of slice.devices X slice.shape.partitions. In other words, for each slice.devices[*], you have slice.shape.partitions available for allocation. This more-of-less models the real world. The slice.devices just contains the actual physical GPUs; the slice.shape.partitions contains the ways those partitionable GPUs can be consumed.

In the "degenerate" case of no partitions, your "shape" amounts to a pool of common attributes that are common across all the devices.

You can set attributes at each of the levels: shape, partition, device. We need to decide on the order of precedence in the event that the same attribute is defined at each level. I would suggest the reverse order I just listed (device attributes overwrite partition attributes which overwrite shape attributes).

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from klueska and pohly June 14, 2024 17:12
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 14, 2024
//
// +listType=atomic
// +optional
Attributes []DeviceAttribute `json:"attributes,omitempty" protobuf:"bytes,3,opt,name=attributes"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these become shared attributes that all partitions share? No ability to define e.g. the common set of attributes just for e.g. 1g.5gb devices that exist at different memory slices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, those still get layered in on a per-partition basis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I want to be able to define e.g.

deviceAttributeGroups:
  - name: common-attributes
    items: ...
  - name: common-mig-1g.5gb-attributes
    items: ...
  - name: common-mig-2g.10gb-attributes
    items: ...

And then layer them in as appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, partitions individually have attributes. So, in the "device" that is presented to the claim side of the model, you have shape.Attributes + partition.Attributes + device.Attributes available to CEL.

I do NOT go so far as to define a "PartitionShape" which common-izes attributes from a particular type of Partition. That is possible but there are diminishing returns. Option 5 does effectively do that (and more).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a byproduct of nesting. If everything is top-level and referenceable by a name then it "just works" and is fully flexible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the "shape" captures the shared capacity / shared capacity consumed model as well as the common attributes once and re-use it across all devices. That's where the big size win comes in, not just with attributes but with the partition model. I am not clear on how you plan to capture both of those, it would be useful to see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried top-level shared capacity groups in Option 2, it actually made things LARGER. Maybe I did it wrong? I don't think I did. I think because they are groups per-GPU, we list them over and over. Whereas with the nesting, we don't need to list them, because they are implicit in the structure.

Maybe we can do that implicit duplication with shared groups at the top level? I am not sure how we know which to do that with and which to as independent. Maybe a flag? We also need to capture the repeated "consumption" models, rather than repeat them on a per-device basis, to get the full effect.

I think that's much harder to understand, frankly, and the only thing I see us gaining is a theoretically ability to have an "independent" shared group that is across the slice not on the per-device basis. Which is something we could always add later at the slice level, if we needed it. We have no need for it at this point, that I know of.

Copy link
Contributor

@klueska klueska Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat off topic, but something you said here made me think of this....

I haven't discussed it much, but MIG actually supports a second level of nesting. I think it could still be modeled with this Option 4 (the second level of nesting still pulls from the same pool of resources).

However, we want to limit selection of devices from this second level of nesting to a single claim (because they all share access to the same physical memory). Meaning if someone requests e.g. a 1c.3g.20gb device in two different claims, they MUST come from different parent MIGs.

i.e. the following, but guarantee that these devices are pulled from a MIG device that has not had other 2nd-level devices pulled from it yet.

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
  name: mig-devices-and-nics
spec:
  requests:
  - name: mig-0
    deviceClassName: gpu.nvidia.com
    expression: "device['type'] == 'MIG' && device['profile'] == '1c.3g.20gb'"
    
  - name: mig-1
    deviceClassName: gpu.nvidia.com
    expression: "device['type'] == 'MIG' && device['profile'] == '2c.3g.20gb'"
    
  requirements:
  - matchAttribute: parentMIG

We currently support this in our "classic" DRA driver (because we control the API through our CRD), but I don't see an easy way to support this restriction with structured parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thougts on this?

Copy link
Contributor Author

@johnbelamaric johnbelamaric Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do these nested partitions really need to be visible in K8s? Are they offered to different containers? Could this be part of the config?

Since they are not consumable across claims, I don't see how K8s even knowing about them matters; the scheduler wouldn't have any flexibility but to give a whole MIG anyway. The only flexibility would be association with different containers. But maybe the driver can handle that for these (don't associate the MIG with any specific container).

//
// +listType=atomic
// +optional
SharedCapacityConsumed []SharedCapacity `json:"sharedCapacityConsumed,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I limited to the SharedCapacity from the DeviceShape where this partition is embedded? So, for example, if there were some "uncore" resources shared by all GPUs we couldn't consume that now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the canonical example, the DeviceShape is a physical GPU, with a map of how it is partitonable. The Devices then just list the physical GPUs. I am not clear on what shared resources might live outside the GPU. Are you saying you need to capture shared resources that are not part of the GPU and somehow span GPUs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I'm saying. I don't necessarily see it for GPUs, but I'm sure there are other devices that will want the ability to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can add something later for that...

const ResourcePoolMaxDevices = 128

type DeviceShape struct {
Copy link
Contributor

@klueska klueska Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling to see how I share attributes across top level GPUs. Is that not supported in this model? Or do I need to put all of my top-level GPUs into one DeviceShape to take advantage of this?

Copy link
Contributor Author

@johnbelamaric johnbelamaric Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is exactly one shape for any slice. Check out the example YAML.

Cardinality:

  • One slice has one shape. The slice effectively encapsulates a model of device. If you have a node with several models of GPU, you would have a slice for each model.
  • One shape has many partitions. This is the "map" of how this shape device can be partitioned.
  • One slice has many devices, all of the same, single shape. Thus, the only thing you need in device entries are the device names and any device-specific attributes.

What you cannot do in this as it exists here is have per-device, per-partition attribute. But otherwise it is as expressive as the fully denormalized model.

If we want to pack things in tighter, we could allow multiple shapes per slice, but I don't see any value in that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need each MIG device to have a reference to its Parent GPU's UUID (so that they can be aligned on that). How does one do that with this model? Likewise, for PCIEroot, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

Now, if we need to have a separate UUID for the MIG device and for the parent device, it is a bit more tricky, because the we may have name conflicts.

So, we may need a little more attribute manipulation magic here. And we may need a standard attribute to mean "partition". I prefer to see if we can get by with published conventions rather than manipulations. Another alternative is qualifying attributes by shape, partition, and device, but that seems very hard on the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device as seen by the scheduler will have all the attributes of the shape, the partition, and the device. So the device UUID will be there for every MIG.

The idea is that the scheduler sees an unrolled / flattened / projected version of this into a flat list of devices. So from the claim standpoint, it is equivalent to the current model.

//
// +listType=atomic
// +optional
Partitions []DevicePartition `json:"partitions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "simple" case, when I just have a bunch of devices that are not partitionable, what does my resourceSlice look like? Do I need to still fill this in with just 1 device (which is actually not partitionable)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Nevermind, I see now -- this is not a replacement for Device, it is something that defines how every device in this slice "looks", modulo its name and any custom attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

- name: uuid
string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06
- bool: true
name: mig-capable
Copy link

@thockin thockin Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get inherited by the partitions in the shape? IOW, do the MIG partitions indicate "mig-capable"?

Do we need some heritable attribtes and some not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That's an issue. Similar to the issue Kevin raised. We might need some attribute "inheritance" decorations which I don't love.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant "class" vs. "instance". I'll try to sketch out what I mean.

Copy link

@thockin thockin Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I hacked up just a couple examples. The device shape is the class, and all attributes defined there are class attributes - they apply to ALL instances of that shape/partition. Attributes defined under devices are instance attributes. If we find that there are a ton of instance attributes, this whole model breaks down, but I think it's a safe bet that MOST attributes are class.

Doing this makes me ask - doesn't each partition of each device need a unique name, which is used from claim status? Or is that just the device name + the partition name? E.g. gpu-0/whole or gpu-7/mig-1g.5gb-0

--- /home/thockin/Downloads/orig.yaml	2024-06-15 14:26:21.270853559 -0700
+++ /home/thockin/Downloads/th-example.yaml	2024-06-15 14:44:01.254635103 -0700
@@ -495,19 +495,77 @@
     - attributes:
       - name: uuid
         string: GPU-e859d6ce-7f9c-4f46-b122-f3df7ebf9f06
-      - bool: true
-        name: mig-capable
-      - bool: false
-        name: mig-enabled
       name: gpu-0
+      partitions: ## NOTE: instantiates the partitions defined in the shape
+      - whole:
+        attributes:
+        - bool: true
+          name: mig-capable
+      ## NOTE: maybe just elide these if they are actually empty, or maybe
+      ## they all will actually have instancy things.  Do they each need a
+      ## name, which is referenced from claim status?
+      - mig-1g.5gb-0: {}
+      - mig-1g.5gb-1: {}
+      - mig-1g.5gb-2: {}
+      - mig-1g.5gb-3: {}
+      - mig-1g.5gb-4: {}
+      - mig-1g.5gb-5: {}
+      - mig-1g.5gb-6: {}
+      - mig-2g.10gb-0: {}
+      - mig-2g.10gb-2: {}
+      - mig-2g.10gb-4: {}
+      - mig-3g.20gb-0: {}
+      - mig-3g.20gb-4: {}
+      - mig-4g.20gb-0: {}
+      - mig-7g.40gb-0: {}
+      - mig-1g.5gb-me-0: {}
+      - mig-1g.5gb-me-1: {}
+      - mig-1g.5gb-me-2: {}
+      - mig-1g.5gb-me-3: {}
+      - mig-1g.5gb-me-4: {}
+      - mig-1g.5gb-me-5: {}
+      - mig-1g.5gb-me-6: {}
+      - mig-1g.10gb-0: {}
+      - mig-1g.10gb-2: {}
+      - mig-1g.10gb-4: {}
+      - mig-1g.10gb-6: {}
     - attributes:
       - name: uuid
         string: GPU-f87f3708-c8cf-4418-933b-2cb12f1f7d7c
-      - bool: true
-        name: mig-capable
-      - bool: false
-        name: mig-enabled
       name: gpu-1
+      partitions: ## NOTE: instantiates the partitions defined in the shape
+      - whole:
+        attributes:
+        - bool: true
+          name: mig-capable
+      ## NOTE: maybe just elide these if they are actually empty, or maybe
+      ## they all will actually have instancy things.  Do they each need a
+      ## name, which is referenced from claim status?
+      - mig-1g.5gb-0: {}
+      - mig-1g.5gb-1: {}
+      - mig-1g.5gb-2: {}
+      - mig-1g.5gb-3: {}
+      - mig-1g.5gb-4: {}
+      - mig-1g.5gb-5: {}
+      - mig-1g.5gb-6: {}
+      - mig-2g.10gb-0: {}
+      - mig-2g.10gb-2: {}
+      - mig-2g.10gb-4: {}
+      - mig-3g.20gb-0: {}
+      - mig-3g.20gb-4: {}
+      - mig-4g.20gb-0: {}
+      - mig-7g.40gb-0: {}
+      - mig-1g.5gb-me-0: {}
+      - mig-1g.5gb-me-1: {}
+      - mig-1g.5gb-me-2: {}
+      - mig-1g.5gb-me-3: {}
+      - mig-1g.5gb-me-4: {}
+      - mig-1g.5gb-me-5: {}
+      - mig-1g.5gb-me-6: {}
+      - mig-1g.10gb-0: {}
+      - mig-1g.10gb-2: {}
+      - mig-1g.10gb-4: {}
+      - mig-1g.10gb-6: {}
     - attributes:
       - name: uuid
         string: GPU-d197c034-8516-4dda-a2cd-67a46457a454

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.

If there are N partitions in shape, and M devices in the slice, the scheduler will be presented with N * M "devices" for satisfying claims from the slice.

This API is just what we need to store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we need to set values on a per device, per partition basis, these instances are redundant.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but they don't need to be stored. They are just the concatenation of the device name with the partition name.

As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?

Unless we need to set values on a per device, per partition basis, these instances are redundant

But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we are clear in spec, I am OK with that. I didn't see it written. Does that mean "gpu-0/whole" is the name? Or should the "whole" partition be specially named as ""?

See https://github.com/kubernetes-sigs/wg-device-management/pull/31/files#diff-a53461b1c5a20f7a5b62cecae4b45445b5c874d8d192093ade65eb9cc06e3ec1R112

It would be gpu-0-whole. We could do "" maybe. The weird thing is that it is still a "partition". I wonder if we can just make "whole device is allocatable" a flag and just use the device name? A sort of "pseudo" partition? Not sure.

But we DO need to. At least one of them: "mig-capable" applies to the "whole" partition and none of the others. I bet there are more. I think I would be fine to elide empty instances here, and treat this like a "last level override" for cases where the instance needs instance attribs.

Yes, that could work. We could also include the "whole device" as something special (as alluded to above). Or do both of those things, they are not mutually exclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a rough outline of my thoughts on (yet another) way to do all of this and will start to refine it in the morning. I wanted to get it out before I go to bed though, in case any initial comments come in overnight that might influence how I push forward on it: #35

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to take a look later tonight

Copy link

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't hate this (certainly smaller!) but I think it sort of conflates "classes" (in the OO sense) and instances. I wonder if there's a more explicit model that is 1 step less optimized but more specific...

@johnbelamaric
Copy link
Contributor Author

I think DeviceShape is basically an OO class. I don't think we really conflate them. There is the shape (class) and the instances (devices). We do flatten that view for the scheduler. Though we don't have to - but I think it is easier for the scheduler and user if we do.

@johnbelamaric
Copy link
Contributor Author

All of these conceptually come down to "what do we normalize so we don't repeat ourselves" and "how do we generate the flat view from the normalized view". There may be a few more variants on that theme.

name: mig-enabled
name: gpu-0
sharedCapacityConsumed:
deviceShape:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This imposes a subtle new requirement, right? All devices in a slice MUST be the same shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. I made a similar comment somewhere as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Of course we could allow multiple shapes and reference then by name in the device list, but I don't see any real value in doing that over just having another slice.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants