Nvidia GPU Operator fixes to get it working properly #27

compendius · 2022-01-27T11:07:45Z

Added

ability to select custom containerd runtime config file which is copied to workers and masters
example of working Nvidia GPU operator deployment with verification test (the old examples do not work)
ability to select a volume type
ability to select boot volume size and type for both workers and masters

remche · 2022-01-27T13:52:30Z

@compendius thanks a lot for your contribution 🎉
Few things before this PR could going on :

could you please rebase with current master ;
could you please clean the git history, or create two different PR (one for the module, one for the example) that I would squash.

I'm also curious of what was not working on jupyterhub-gpu example, because it has been successfully tested on quiet a few platform.

Thanks !

compendius · 2022-01-27T15:31:37Z

@remche thanks for considering the PR .I will try and rebase etc. In the meantime the reason I had to alter the GPU Operator is that when deployed despite looking like it is running ok it does not work properly. For example the recommended vectoradd test fails https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#cuda-vectoradd

$ kubectl logs cuda-vectoradd

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]`

But when you add the custom containerd runtime config and extra environmental variables it works -

$ kubectl logs cuda-vectoradd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Similar to this approach https://thenewstack.io/install-a-nvidia-gpu-operator-on-rke2-kubernetes-cluster/

variables.tf

modules/agent/main.tf

variables.tf

remche · 2022-01-27T16:58:28Z

Just checked again and you're right that jupyter-gpu does not work out of the box, we were using air-gapped image 🤣

Could you please update the jupyterhub-gpu example at the same time 😬

thanks again !

compendius · 2022-01-27T17:55:33Z

I have attempted to rebase......and pushed changes . Need to clean history up next

remche · 2022-01-27T20:59:03Z

Thanks for rebasing/cleaning. Could you please terraform fmt -recursive ?

Few thing left :

the boot_* module agent variables should not be required (set default) ;
the example should work out of the box. We need a variables.tf and versions.tf.

Thanks again for making this module better !

variables.tf

remche · 2022-01-28T13:55:04Z

thanks a lot !!

compendius · 2022-01-28T14:07:42Z

no problem. Glad to be of help. We are using this GPU stack for production workloads so wanted to share how we did it

compendius · 2022-01-28T14:20:04Z

just realised the source needs updating away from my fork in my Nvidia gpu example main.tf. Do you want another PR?

remche requested changes Jan 27, 2022

View reviewed changes

variables.tf Outdated Show resolved Hide resolved

modules/agent/main.tf Outdated Show resolved Hide resolved

variables.tf Outdated Show resolved Hide resolved

compendius force-pushed the master branch from e6c074c to 4834306 Compare January 27, 2022 17:50

Charles Short added 3 commits January 27, 2022 19:54

added volume type

768478e

add containerd config file

0e4aed9

added example for nvidia GPU

ba4202b

compendius force-pushed the master branch from 376f96d to ba4202b Compare January 27, 2022 19:57

Charles Short added 3 commits January 27, 2022 21:08

fixed jupyterhub-gpu example

75e0709

added variables.tf and versions.tf

f4f310a

terraform fmt -recursive

b4b3501

remche reviewed Jan 28, 2022

View reviewed changes

variables.tf Outdated Show resolved Hide resolved

Charles Short added 2 commits January 28, 2022 13:43

added default values to agent boot vars

552b5e0

ssh agent to true

b8755c0

remche approved these changes Jan 28, 2022

View reviewed changes

remche merged commit 6935bbf into remche:master Jan 28, 2022

remche mentioned this pull request Feb 11, 2022

config.toml template and base64 #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia GPU Operator fixes to get it working properly #27

Nvidia GPU Operator fixes to get it working properly #27

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

remche commented Jan 28, 2022

compendius commented Jan 28, 2022

compendius commented Jan 28, 2022

Nvidia GPU Operator fixes to get it working properly #27

Nvidia GPU Operator fixes to get it working properly #27

Conversation

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

compendius commented Jan 27, 2022

remche commented Jan 27, 2022

remche commented Jan 28, 2022

compendius commented Jan 28, 2022

compendius commented Jan 28, 2022