Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lower gpu node cpu-request to allow scheduling #57

Merged
merged 1 commit into from
Feb 16, 2021

Conversation

blairdrummond
Copy link

@blairdrummond blairdrummond commented Feb 12, 2021

My colleague's GPU image was not able to be scheduled, it appears because of the CPU limits. After doing a few kubectl edits on his image, adjusting the cpu and memory, I found that reducing the CPU without touching the memory was enough to get the image scheduled.

Closes #58

@blairdrummond
Copy link
Author

This is an out-of-the-box GPU image

blair@system76-pc ~/jupyter-apis (lower-gpu-cpu-request)$ k describe pod -n blair-drummond x-0 | tail
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Warning  FailedScheduling   <unknown>             default-scheduler   0/41 nodes are available: 2 Insufficient pods, 35 Insufficient cpu, 4 Insufficient memory.
  Warning  FailedScheduling   <unknown>             default-scheduler   0/41 nodes are available: 2 Insufficient pods, 35 Insufficient cpu, 4 Insufficient memory.
  Normal   NotTriggerScaleUp  12m (x3 over 14m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient nvidia.com/gpu, 1 Insufficient cpu, 1 Insufficient memory
  Normal   NotTriggerScaleUp  10m (x5 over 14m)     cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory, 1 Insufficient nvidia.com/gpu, 1 Insufficient cpu
  Normal   NotTriggerScaleUp  4m33s (x44 over 14m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 1 Insufficient nvidia.com/gpu

@blairdrummond
Copy link
Author

@brendangadd any objections to merging?

@blairdrummond
Copy link
Author

@brendangadd I think this is good, but don't know anything about how this gets deployed

@ca-scribner
Copy link
Contributor

per @wg102, this gets deployed via the manifests. So merging this in is safe

@ca-scribner ca-scribner merged commit 0b55d4d into master Feb 16, 2021
@ca-scribner
Copy link
Contributor

PR to deploy here: StatCan/aaw-kubeflow-manifests#86

@sylus sylus deleted the lower-gpu-cpu-request branch February 8, 2022 20:28
@wg102 wg102 mentioned this pull request Sep 12, 2022
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU image fails to start with current defaults
3 participants