Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it necessary to install and run the gpu-operator for time-slicing to work? #461

Open
andersla opened this issue Nov 30, 2023 · 3 comments
Labels
question Categorizes issue or PR as a support question.

Comments

@andersla
Copy link

andersla commented Nov 30, 2023

1. Issue or Feature description

Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?

I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)

I can start 10 concurrent pods all recieving a gpu.

When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.

I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?

my plugin config:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

These are the pods running in our cluster:

kubectl get po -n nvidia-device-plugin 
NAME                                                  READY   STATUS    RESTARTS        AGE
nvdp-gpu-feature-discovery-jvsr9                      2/2     Running   0               2d11h
nvdp-gpu-feature-discovery-x4n4w                      2/2     Running   0               2d11h
nvdp-node-feature-discovery-master-6954c9cd9c-76f4v   1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-6db58              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-6xr9b              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-97x7n              1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-nvzpj              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-qmrd4              1/1     Running   1 (2d11h ago)   2d11h
nvdp-nvidia-device-plugin-5lpjk                       2/2     Running   0               2d11h
nvdp-nvidia-device-plugin-jbjcx                       2/2     Running   0               2d11h
@klueska klueska added the question Categorizes issue or PR as a support question. label Jan 25, 2024
@shahaf600
Copy link

I am wondering as well.

@elezar
Copy link
Member

elezar commented Feb 6, 2024

It is not required to use the GPU Operator for timeslicing. One thing to note is that timeslicing does not prevent the different processes sharing the GPU from using all the GPUs memory. Could it be that the first pod scheduled to a GPU is consuming all the GPU memory?

@arthur-r-oliveira
Copy link

arthur-r-oliveira commented May 10, 2024

I've tested time-slicing with RHEL 9.3, MicroShift 4.15.z, Standard NC4as T4 v3 (4 vcpus, 28 GiB memory) and NVIDIA TU104GL [Tesla T4] GPU, and it works as expected. Just created this PR #702 with a sample manifest.

@elezar is that make sense to have this PR accepted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

5 participants