You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?
I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)
I can start 10 concurrent pods all recieving a gpu.
When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.
I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?
It is not required to use the GPU Operator for timeslicing. One thing to note is that timeslicing does not prevent the different processes sharing the GPU from using all the GPUs memory. Could it be that the first pod scheduled to a GPU is consuming all the GPU memory?
I've tested time-slicing with RHEL 9.3, MicroShift 4.15.z, Standard NC4as T4 v3 (4 vcpus, 28 GiB memory) and NVIDIA TU104GL [Tesla T4] GPU, and it works as expected. Just created this PR #702 with a sample manifest.
@elezar is that make sense to have this PR accepted?
1. Issue or Feature description
Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?
I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)
I can start 10 concurrent pods all recieving a gpu.
When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.
I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?
my plugin config:
These are the pods running in our cluster:
The text was updated successfully, but these errors were encountered: