Re-enable GPU profiles for GCP/AWS #1219

viniciusdc · 2022-03-31T20:49:26Z

Fixes | Closes | Resolves #1209

Changes introduced in this PR:

Created a new file on 03-kubernetes-initialize stage, for Nvidia driver daemonset
Included new import var gpu_node_group_names to validate count for daemonset resource
Added schema validation for guest_accelerators

Types of changes

What types of changes does your PR introduce?

Put an x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features to not work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):

Testing

Requires testing

Yes
No

In case you checked yes, did you write tests?

Yes
No

Further comments (optional)

Tested the GPU was available in the profile executing nvidia-smi

viniciusdc · 2022-03-31T20:56:36Z

This re-enables Profiles with GPU support, I think we should also include in this PR the jupyterhub version bump, as the GPU global settings issue #1201 might show during unexpected circumstances:

│   Normal   NotTriggerScaleUp  3m51s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity/selector, 1 m │
│ ax node group size reached                                                                                                                                                                                                                │
│   Warning  FailedScheduling   39s (x5 over 3m53s)  default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tole │
│ rate, 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate.

Those messages was originated from the conda-store worker pod after it was evicted

viniciusdc · 2022-03-31T20:58:45Z

The reason I opted for

variable "gpu_node_group_names" {
  description = "Names of node groups with GPU"
  default = []
}

instead of a simples bool conditional to enable the daemon was to keep a standard to what was done with AWS provider. We can change that later if needed.

viniciusdc · 2022-03-31T21:01:38Z

This should not be merged until:

The post-release is completed
The taint issue above, nvidia.com/gpu: present, is fixed within this PR (or a linked PR is submitted)

viniciusdc · 2022-04-19T14:22:15Z

In hold for #1227

qhub/stages/input_vars.py

qhub/template/stages/02-infrastructure/gcp/modules/kubernetes/locals.tf

viniciusdc · 2022-05-05T19:37:11Z

Just found a bug, pushing the fix right away

viniciusdc · 2022-05-06T15:13:55Z

In hold for quota increase to test AWS

viniciusdc added 3 commits March 29, 2022 16:13

Add guest_accelerators field and validation to scheema

4f63431

Re-enable GPU support on GCP

73d9083

Mv nvidia driver daemonset to 03-stage

82c2c7c

viniciusdc requested a review from costrouc March 31, 2022 20:56

viniciusdc added needs: review 👀 This PR is complete and ready for reviewing DO-NOT-MERGE labels Mar 31, 2022

viniciusdc added provider: GCP area: terraform 💾 labels Mar 31, 2022

Add daemonset-driver source

d4bb9b0

magsol added this to the Release v0.4.1 milestone Apr 26, 2022

costrouc reviewed Apr 26, 2022

View reviewed changes

qhub/stages/input_vars.py Show resolved Hide resolved

viniciusdc added 2 commits April 26, 2022 16:11

Update input vars

b4cafc2

mv aws nvidia plugin to 03 stage

336e0f5

viniciusdc changed the title ~~Re-enable GPU profiles for GCP~~ Re-enable GPU profiles for GCP/AWS Apr 26, 2022

viniciusdc requested a review from costrouc April 26, 2022 19:17

costrouc reviewed Apr 26, 2022

View reviewed changes

qhub/template/stages/02-infrastructure/gcp/modules/kubernetes/locals.tf Show resolved Hide resolved

costrouc approved these changes Apr 26, 2022

View reviewed changes

costrouc removed the DO-NOT-MERGE label Apr 28, 2022

viniciusdc added 2 commits May 5, 2022 19:24

Add aws provider config to fix missing region error

5c0348a

Fix bug with json serialization

6af6f6b

iameskild mentioned this pull request May 5, 2022

v0.4.1 discussion #1270

Closed

2 tasks

viniciusdc added the status: blocked ⛔️ This item is on hold due to another task label May 6, 2022

viniciusdc added 2 commits May 6, 2022 12:16

Blacken

67dab71

Undo commit 5c0348a

1a92f32

Remove stray comment

18db11f

viniciusdc added provider: AWS and removed status: blocked ⛔️ This item is on hold due to another task labels May 9, 2022

iameskild approved these changes May 9, 2022

View reviewed changes

iameskild merged commit 4c864db into main May 9, 2022

iameskild deleted the fix-1209-gpu branch May 9, 2022 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable GPU profiles for GCP/AWS #1219

Re-enable GPU profiles for GCP/AWS #1219

viniciusdc commented Mar 31, 2022 •

edited

Loading

viniciusdc commented Mar 31, 2022

viniciusdc commented Mar 31, 2022

viniciusdc commented Mar 31, 2022 •

edited

Loading

viniciusdc commented Apr 19, 2022

viniciusdc commented May 5, 2022

viniciusdc commented May 6, 2022

Re-enable GPU profiles for GCP/AWS #1219

Re-enable GPU profiles for GCP/AWS #1219

Conversation

viniciusdc commented Mar 31, 2022 • edited Loading

Changes introduced in this PR:

Types of changes

Testing

Requires testing

In case you checked yes, did you write tests?

Further comments (optional)

viniciusdc commented Mar 31, 2022

viniciusdc commented Mar 31, 2022

viniciusdc commented Mar 31, 2022 • edited Loading

viniciusdc commented Apr 19, 2022

viniciusdc commented May 5, 2022

viniciusdc commented May 6, 2022

viniciusdc commented Mar 31, 2022 •

edited

Loading

viniciusdc commented Mar 31, 2022 •

edited

Loading