Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubespray v2.18.0 and containerd runtime #1043

Merged
merged 19 commits into from
Mar 24, 2022

Conversation

dholt
Copy link
Contributor

@dholt dholt commented Oct 6, 2021

Upgrades Kubespray to v2.18.0 and sets containerd as the default k8s runtime instead of docker.

Using containerd also means we're now using the GPU operator by default

@dholt dholt marked this pull request as ready for review October 8, 2021 21:03
@supertetelman
Copy link
Collaborator

Before merging this and making GPU Operator the default we should really fix monitoring: #890

I have an open PR to bump the version and fix DGX support: #1069

@supertetelman
Copy link
Collaborator

supertetelman commented Dec 10, 2021

Before merging this through and essentially making GPU Operator the default path, I propose we get GPU Operator/ DGX support working (#1074) get DCGM/Prometheus support integrated (#890)

Until then, could we just check this PR, but keep the default as Docker?

Also we should introduce another test that does Docker installs in the nightly.

@supertetelman
Copy link
Collaborator

DGX Support with GPU Operator is complete. GPU Operator integration into our monitoring stack is complete. Just need to merge #1074 with the content of this PR.

@ajdecon
Copy link
Collaborator

ajdecon commented Feb 14, 2022

DeepOps is currently pinned to Kubespray v2.17.1, which is newer than this PR is using.

Additionally, we're seeing issues in DGX OS 5.1.x (#1110) up to Kubespray v2.17.1.

So when we're ready for this PR to move forward, we should bump the Kubespray version to v2.18.0 (or whatever is latest).

* master: (230 commits)
  fix ansible install in lint action
  Pin versions for ansible-lint to 5.4.0
  add linting exception for two shell statements where we aren't sure it will work
  add header comment to ansible-lint-roles.sh
  ansible-lint: remove file permission warnings for ood-wrapper
  shellcheck ansible-lint-roles.sh
  add notes on linting and run dos2unix to fix line endings
  print list of roles excluded when running ansible-lint
  move metadata warnings to lint skip_list
  ansible-lint role netapp-trident
  add mechanism to exclude known-problematic-roles
  show a summary at end of linting of failed roles
  ansible-lint role nfs
  ansible-lint role slurm
  ansible-lint role singularity_wrapper
  ansible-lint role roce_backend
  ansible-lint role pyxis
  ansible-lint role prometheus-slurm-exporter
  ansible-lint role prometheus-node-exporter
  ansible-lint role prometheus
  ...
@dholt dholt changed the title Kubespray v2.17.0 and containerd runtime Kubespray v2.18.0 and containerd runtime Mar 23, 2022
@dholt dholt force-pushed the switch-container-runtime branch from f545fe1 to 0d67d3d Compare March 23, 2022 20:24
@dholt dholt requested a review from supertetelman March 23, 2022 21:29
Copy link
Collaborator

@supertetelman supertetelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'd like to add some additional documentation here eventually and some additional testing to outline the different K8s configurations options and validate them.

I think we have

  • Device Plugin + Docker (with nvidia-driver)
  • Device Plugin + containerd (with nvidia-driver)
  • GPU Operator + Docker (with driver container or with nvidia-driver)
  • GPU Operator + containerd(with driver container or with nvidia-driver)

Is Device Plugin with Docker a supported path now? Do we have 6 total core configurations here?

@dholt dholt merged commit cfb9fd8 into NVIDIA:master Mar 24, 2022
@dholt dholt deleted the switch-container-runtime branch March 24, 2022 14:04
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants