Is it necessary to install and run the gpu-operator for time-slicing to work? #461

andersla · 2023-11-30T12:58:12Z

1. Issue or Feature description

Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?

I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)

I can start 10 concurrent pods all recieving a gpu.

When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.

I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?

my plugin config:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

These are the pods running in our cluster:

kubectl get po -n nvidia-device-plugin 
NAME                                                  READY   STATUS    RESTARTS        AGE
nvdp-gpu-feature-discovery-jvsr9                      2/2     Running   0               2d11h
nvdp-gpu-feature-discovery-x4n4w                      2/2     Running   0               2d11h
nvdp-node-feature-discovery-master-6954c9cd9c-76f4v   1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-6db58              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-6xr9b              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-97x7n              1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-nvzpj              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-qmrd4              1/1     Running   1 (2d11h ago)   2d11h
nvdp-nvidia-device-plugin-5lpjk                       2/2     Running   0               2d11h
nvdp-nvidia-device-plugin-jbjcx                       2/2     Running   0               2d11h

The text was updated successfully, but these errors were encountered:

shahaf600 · 2024-02-06T16:49:57Z

I am wondering as well.

elezar · 2024-02-06T21:01:59Z

It is not required to use the GPU Operator for timeslicing. One thing to note is that timeslicing does not prevent the different processes sharing the GPU from using all the GPUs memory. Could it be that the first pod scheduled to a GPU is consuming all the GPU memory?

arthur-r-oliveira · 2024-05-10T15:26:05Z

I've tested time-slicing with RHEL 9.3, MicroShift 4.15.z, Standard NC4as T4 v3 (4 vcpus, 28 GiB memory) and NVIDIA TU104GL [Tesla T4] GPU, and it works as expected. Just created this PR #702 with a sample manifest.

@elezar is that make sense to have this PR accepted?

klueska added the question Categorizes issue or PR as a support question. label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it necessary to install and run the gpu-operator for time-slicing to work? #461

Is it necessary to install and run the gpu-operator for time-slicing to work? #461

andersla commented Nov 30, 2023 •

edited

Loading

shahaf600 commented Feb 6, 2024

elezar commented Feb 6, 2024

arthur-r-oliveira commented May 10, 2024 •

edited

Loading

Is it necessary to install and run the gpu-operator for time-slicing to work? #461

Is it necessary to install and run the gpu-operator for time-slicing to work? #461

Comments

andersla commented Nov 30, 2023 • edited Loading

1. Issue or Feature description

shahaf600 commented Feb 6, 2024

elezar commented Feb 6, 2024

arthur-r-oliveira commented May 10, 2024 • edited Loading

andersla commented Nov 30, 2023 •

edited

Loading

arthur-r-oliveira commented May 10, 2024 •

edited

Loading