fix: lower gpu node cpu-request to allow scheduling #57

blairdrummond · 2021-02-12T13:16:01Z

My colleague's GPU image was not able to be scheduled, it appears because of the CPU limits. After doing a few kubectl edits on his image, adjusting the cpu and memory, I found that reducing the CPU without touching the memory was enough to get the image scheduled.

Closes #58

blairdrummond · 2021-02-12T13:32:38Z

This is an out-of-the-box GPU image

blair@system76-pc ~/jupyter-apis (lower-gpu-cpu-request)$ k describe pod -n blair-drummond x-0 | tail
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Warning  FailedScheduling   <unknown>             default-scheduler   0/41 nodes are available: 2 Insufficient pods, 35 Insufficient cpu, 4 Insufficient memory.
  Warning  FailedScheduling   <unknown>             default-scheduler   0/41 nodes are available: 2 Insufficient pods, 35 Insufficient cpu, 4 Insufficient memory.
  Normal   NotTriggerScaleUp  12m (x3 over 14m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient nvidia.com/gpu, 1 Insufficient cpu, 1 Insufficient memory
  Normal   NotTriggerScaleUp  10m (x5 over 14m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory, 1 Insufficient nvidia.com/gpu, 1 Insufficient cpu
  Normal   NotTriggerScaleUp  4m33s (x44 over 14m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 1 Insufficient nvidia.com/gpu

blairdrummond · 2021-02-12T15:37:17Z

@brendangadd any objections to merging?

blairdrummond · 2021-02-13T18:29:20Z

@brendangadd I think this is good, but don't know anything about how this gets deployed

ca-scribner · 2021-02-16T16:32:48Z

per @wg102, this gets deployed via the manifests. So merging this in is safe

ca-scribner · 2021-02-16T16:36:37Z

PR to deploy here: StatCan/aaw-kubeflow-manifests#86

fix: lower gpu node cpu-request to allow scheduling

33372de

blairdrummond requested a review from saffaalvi February 12, 2021 13:16

blairdrummond mentioned this pull request Feb 12, 2021

GPU image fails to start with current defaults #58

Closed

saffaalvi approved these changes Feb 12, 2021

View reviewed changes

blairdrummond requested a review from brendangadd February 13, 2021 18:29

ca-scribner merged commit 0b55d4d into master Feb 16, 2021

sylus deleted the lower-gpu-cpu-request branch February 8, 2022 20:28

wg102 mentioned this pull request Sep 12, 2022

[Epic] Upgrade JWA 1.6 Commits to Redo #131

Closed

15 tasks

mathis-marcotte mentioned this pull request Sep 21, 2022

fix-1.6 Modify limits, gpus and cpus #137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: lower gpu node cpu-request to allow scheduling #57

fix: lower gpu node cpu-request to allow scheduling #57

blairdrummond commented Feb 12, 2021 •

edited by brendangadd

Loading

blairdrummond commented Feb 12, 2021

blairdrummond commented Feb 12, 2021

blairdrummond commented Feb 13, 2021

ca-scribner commented Feb 16, 2021

ca-scribner commented Feb 16, 2021

fix: lower gpu node cpu-request to allow scheduling #57

fix: lower gpu node cpu-request to allow scheduling #57

Conversation

blairdrummond commented Feb 12, 2021 • edited by brendangadd Loading

blairdrummond commented Feb 12, 2021

blairdrummond commented Feb 12, 2021

blairdrummond commented Feb 13, 2021

ca-scribner commented Feb 16, 2021

ca-scribner commented Feb 16, 2021

blairdrummond commented Feb 12, 2021 •

edited by brendangadd

Loading