-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubespray v2.18.0 and containerd runtime #1043
Conversation
Before merging this through and essentially making GPU Operator the default path, I propose we get GPU Operator/ DGX support working (#1074) get DCGM/Prometheus support integrated (#890) Until then, could we just check this PR, but keep the default as Docker? Also we should introduce another test that does Docker installs in the nightly. |
DGX Support with GPU Operator is complete. GPU Operator integration into our monitoring stack is complete. Just need to merge #1074 with the content of this PR. |
DeepOps is currently pinned to Kubespray v2.17.1, which is newer than this PR is using. Additionally, we're seeing issues in DGX OS 5.1.x (#1110) up to Kubespray v2.17.1. So when we're ready for this PR to move forward, we should bump the Kubespray version to v2.18.0 (or whatever is latest). |
* master: (230 commits) fix ansible install in lint action Pin versions for ansible-lint to 5.4.0 add linting exception for two shell statements where we aren't sure it will work add header comment to ansible-lint-roles.sh ansible-lint: remove file permission warnings for ood-wrapper shellcheck ansible-lint-roles.sh add notes on linting and run dos2unix to fix line endings print list of roles excluded when running ansible-lint move metadata warnings to lint skip_list ansible-lint role netapp-trident add mechanism to exclude known-problematic-roles show a summary at end of linting of failed roles ansible-lint role nfs ansible-lint role slurm ansible-lint role singularity_wrapper ansible-lint role roce_backend ansible-lint role pyxis ansible-lint role prometheus-slurm-exporter ansible-lint role prometheus-node-exporter ansible-lint role prometheus ...
f545fe1
to
0d67d3d
Compare
…ontainer runtime via kubespray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'd like to add some additional documentation here eventually and some additional testing to outline the different K8s configurations options and validate them.
I think we have
- Device Plugin + Docker (with nvidia-driver)
- Device Plugin + containerd (with nvidia-driver)
- GPU Operator + Docker (with driver container or with nvidia-driver)
- GPU Operator + containerd(with driver container or with nvidia-driver)
Is Device Plugin with Docker a supported path now? Do we have 6 total core configurations here?
Upgrades Kubespray to v2.18.0 and sets containerd as the default k8s runtime instead of docker.
Using containerd also means we're now using the GPU operator by default