Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia GPU Operator fixes to get it working properly #27

Merged
merged 8 commits into from
Jan 28, 2022

Conversation

compendius
Copy link

Added

  • ability to select custom containerd runtime config file which is copied to workers and masters
  • example of working Nvidia GPU operator deployment with verification test (the old examples do not work)
  • ability to select a volume type
  • ability to select boot volume size and type for both workers and masters

@remche
Copy link
Owner

remche commented Jan 27, 2022

@compendius thanks a lot for your contribution 🎉
Few things before this PR could going on :

  • could you please rebase with current master ;
  • could you please clean the git history, or create two different PR (one for the module, one for the example) that I would squash.

I'm also curious of what was not working on jupyterhub-gpu example, because it has been successfully tested on quiet a few platform.

Thanks !

@compendius
Copy link
Author

@remche thanks for considering the PR .I will try and rebase etc. In the meantime the reason I had to alter the GPU Operator is that when deployed despite looking like it is running ok it does not work properly. For example the recommended vectoradd test fails https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#cuda-vectoradd

$ kubectl logs cuda-vectoradd

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]`

But when you add the custom containerd runtime config and extra environmental variables it works -

$ kubectl logs cuda-vectoradd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Similar to this approach https://thenewstack.io/install-a-nvidia-gpu-operator-on-rke2-kubernetes-cluster/

variables.tf Outdated Show resolved Hide resolved
modules/agent/main.tf Outdated Show resolved Hide resolved
variables.tf Outdated Show resolved Hide resolved
@remche
Copy link
Owner

remche commented Jan 27, 2022

Just checked again and you're right that jupyter-gpu does not work out of the box, we were using air-gapped image 🤣

Could you please update the jupyterhub-gpu example at the same time 😬

thanks again !

@compendius
Copy link
Author

I have attempted to rebase......and pushed changes . Need to clean history up next

@remche
Copy link
Owner

remche commented Jan 27, 2022

Thanks for rebasing/cleaning. Could you please terraform fmt -recursive ?

Few thing left :

  • the boot_* module agent variables should not be required (set default) ;
  • the example should work out of the box. We need a variables.tf and versions.tf.

Thanks again for making this module better !

variables.tf Outdated Show resolved Hide resolved
@remche remche merged commit 6935bbf into remche:master Jan 28, 2022
@remche
Copy link
Owner

remche commented Jan 28, 2022

thanks a lot !!

@compendius
Copy link
Author

no problem. Glad to be of help. We are using this GPU stack for production workloads so wanted to share how we did it

@compendius
Copy link
Author

just realised the source needs updating away from my fork in my Nvidia gpu example main.tf. Do you want another PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants