Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] PoC ResourceClass #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

[WIP] PoC ResourceClass #3

wants to merge 3 commits into from

Conversation

vikaschoudhary16
Copy link
Owner

No description provided.

@vikaschoudhary16 vikaschoudhary16 force-pushed the resource_class branch 2 times, most recently from 90af11e to 99325c2 Compare May 30, 2017 05:56
@@ -0,0 +1,14 @@
---
kind: ResourceClass
metadata:
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[root@dell-r620-01 kubernetes]# kubectl get resourceclass my.rc2 -o yaml
apiVersion: v1
kind: ResourceClass
metadata:
creationTimestamp: 2017-05-30T05:27:48Z
name: my.rc2
resourceVersion: "15222"
selfLink: /api/v1/namespaces/default/resourceclasses/my.rc2
uid: b820fe43-44f8-11e7-94f7-bc305bf4e400
spec:
resourceSelector:

  • matchExpressions:
    • key: Type
      operator: In
      values:
      • nvidia-gpu
        status: {}

metadata:
name: nvidia-tesla-gpu
labels:
type: nvidia-gpu
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[root@dell-r620-01 kubernetes]# kubectl get node mynode -o yaml

apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: 2017-05-24T16:32:26Z
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
uid: 927af813-409e-11e7-80f5-bc305bf4e400
spec:
externalID: dell-r620-01.perf.lab.eng.rdu.redhat.com
status:
addresses:

  • address: 10.12.20.41
    type: InternalIP
  • address: dell-r620-01.perf.lab.eng.rdu.redhat.com
    type: Hostname
  • address: 10.12.20.41
    type: LegacyHostIP
    allocatable:
    cpu: "16"
    memory: 131714184Ki
    pods: "110"
    capacity:
    cpu: "16"
    memory: 131816584Ki
    pods: "110"
    conditions:

daemonEndpoints:
kubeletEndpoint:
Port: 10250

deviceAllocatable:
- metadata:
    creationTimestamp: null
    labels:
      compute-ability: "3.7"
      ecc: "true"
      family: tesla
      memory: 10Gi
      model: k80
      nvlink: "true"
      quantity: "8"
      type: nvidia-gpu
    name: nvidia-tesla-gpu
deviceCapacity:
- metadata:
    creationTimestamp: null
    labels:
      compute-ability: "3.7"
      ecc: "true"
      family: tesla
      memory: 10Gi
      model: k80
      nvlink: "true"
      quantity: "8"
      type: nvidia-gpu
    name: nvidia-tesla-gpu

images:

  • names:
    • kube-build:build-748641720b-5-v1.8.1-2
      sizeBytes: 3158843585
  • names:
    • @
    • :
      sizeBytes: 3158843585
  • names:
    • gcr.io/google_containers/kube-cross@sha256:ac1327a97b1b3e01d2ded9a781bb5675cd657af3c9d1ff56b97fcc3f2f2cd59c
    • gcr.io/google_containers/kube-cross:v1.8.1-2
      sizeBytes: 2357991939
  • names:

@aveshagarwal
Copy link

@vikaschoudhary16 I will review it today

@aveshagarwal
Copy link

In the first commit, idea that we had discussed was to read from one json file that has list of devices. So lets discuss to simpify changes in pkg/util/node/node.go. I am going through other commits.

Resource class creation and node status updation basic testingg done.
Templates in this commit are used for testing resourceclass creation and
devices updation in node status. Adding fixes for the issues encountered
during this testing.
@vikaschoudhary16 vikaschoudhary16 force-pushed the resource_class branch 3 times, most recently from b491d69 to 1cf0ec5 Compare June 12, 2017 10:23
A very minimal testing done.
Overlapping resource classes testing is under progress.

Design doc will be shared soon.
Early reviews on design/approach are welcome.

To be implemented:
- Resource class delete and node delete handling
Copy link

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Vikas! A few questions inline... but mostly from reading through this, I get the feeling there's just a bit too much coupling between both the device plugin subsystem and resource classes (as well as a concern there is similar coupling between node and resource class).

Best,
-jay

@@ -327,12 +328,15 @@ func ClusterRoles() []rbac.ClusterRole {
rbac.NewRule(Read...).Groups(legacyGroup).Resources("nodes", "pods").RuleOrDie(),
rbac.NewRule("create").Groups(legacyGroup).Resources("pods/binding", "bindings").RuleOrDie(),
rbac.NewRule("update").Groups(legacyGroup).Resources("pods/status").RuleOrDie(),
rbac.NewRule("patch").Groups(legacyGroup).Resources("pods").RuleOrDie(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to your patch? Maybe copy/pasted by accident?

@@ -45,10 +45,11 @@ func (pfactory *PredicateMetadataFactory) GetMetadata(pod *v1.Pod, nodeNameToInf
if err != nil {
return nil
}
resReq, _ := GetResourceRequest(pod)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and line 52 seem to be unrelated changes?

fmt.Printf("\n%s c.List %p \n", file_line(), c.List)
allResClasses, err := c.List(labels.Everything())
if err != nil {
return nil, fmt.Errorf("error retrieving rcList from cache: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth spelling out resourceclassList instead of rcList

// C2:
// CPU: 1
// Memory: 1G
//
// Result: CPU: 3, Memory: 3G
func GetResourceRequest(pod *v1.Pod) *schedulercache.Resource {
// Result: CPU: 3, Memory: 3G, ['nvidia-gpu': 2, 'solarflare-40gig': 1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems specific to your environment. What's the plan for making this more generic?

// Result: CPU: 3, Memory: 3G
func GetResourceRequest(pod *v1.Pod) *schedulercache.Resource {
// Result: CPU: 3, Memory: 3G, ['nvidia-gpu': 2, 'solarflare-40gig': 1]
func GetResourceRequest(pod *v1.Pod) (*schedulercache.Resource, *map[string]int32) {
result := schedulercache.Resource{}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious on this... did you consider modifying the schedulercache.Resource struct to include information about the new resource classes? Or is this just proof-of-concept stuff? I'm wondering because this kind of gets at my comment on the spec PR about the resource classes being special snowflakes compared to other resources.

rc.resClass = rClass
for _, info := range cache.nodes {
fmt.Printf("\n%s cache.nodes %+v\n", file_line(), cache.nodes)
rcPerNodeInfo, err := info.AddResourceClass(rClass, info.node)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells a little too much like hard coupling between nodes and resource classes. Or am I missing something non-obvious here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants