Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable GPU profiles for GCP/AWS #1219

Merged
merged 11 commits into from
May 9, 2022
Merged

Re-enable GPU profiles for GCP/AWS #1219

merged 11 commits into from
May 9, 2022

Conversation

viniciusdc
Copy link
Contributor

@viniciusdc viniciusdc commented Mar 31, 2022

Fixes | Closes | Resolves #1209

Changes introduced in this PR:

  • Created a new file on 03-kubernetes-initialize stage, for Nvidia driver daemonset
  • Included new import var gpu_node_group_names to validate count for daemonset resource
  • Added schema validation for guest_accelerators

Types of changes

What types of changes does your PR introduce?

Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features to not work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

Requires testing

  • Yes
  • No

In case you checked yes, did you write tests?

  • Yes
  • No

Further comments (optional)

Tested the GPU was available in the profile executing nvidia-smi

image

@viniciusdc
Copy link
Contributor Author

This re-enables Profiles with GPU support, I think we should also include in this PR the jupyterhub version bump, as the GPU global settings issue #1201 might show during unexpected circumstances:

│   Normal   NotTriggerScaleUp  3m51s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity/selector, 1 m │
│ ax node group size reached                                                                                                                                                                                                                │
│   Warning  FailedScheduling   39s (x5 over 3m53s)  default-scheduler   0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tole │
│ rate, 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate.     

Those messages was originated from the conda-store worker pod after it was evicted

@viniciusdc viniciusdc requested a review from costrouc March 31, 2022 20:56
@viniciusdc
Copy link
Contributor Author

The reason I opted for

variable "gpu_node_group_names" {
  description = "Names of node groups with GPU"
  default = []
}

instead of a simples bool conditional to enable the daemon was to keep a standard to what was done with AWS provider. We can change that later if needed.

@viniciusdc viniciusdc added needs: review 👀 This PR is complete and ready for reviewing DO-NOT-MERGE labels Mar 31, 2022
@viniciusdc
Copy link
Contributor Author

viniciusdc commented Mar 31, 2022

This should not be merged until:

  • The post-release is completed
  • The taint issue above, nvidia.com/gpu: present, is fixed within this PR (or a linked PR is submitted)

@viniciusdc
Copy link
Contributor Author

In hold for #1227

@magsol magsol added this to the Release v0.4.1 milestone Apr 26, 2022
@viniciusdc viniciusdc changed the title Re-enable GPU profiles for GCP Re-enable GPU profiles for GCP/AWS Apr 26, 2022
@viniciusdc viniciusdc requested a review from costrouc April 26, 2022 19:17
@viniciusdc
Copy link
Contributor Author

Just found a bug, pushing the fix right away

@iameskild iameskild mentioned this pull request May 5, 2022
2 tasks
@viniciusdc
Copy link
Contributor Author

In hold for quota increase to test AWS

@viniciusdc viniciusdc added the status: blocked ⛔️ This item is on hold due to another task label May 6, 2022
@viniciusdc viniciusdc added provider: AWS and removed status: blocked ⛔️ This item is on hold due to another task labels May 9, 2022
@iameskild iameskild merged commit 4c864db into main May 9, 2022
@iameskild iameskild deleted the fix-1209-gpu branch May 9, 2022 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] - GCP GPU support
4 participants