MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

lengrongfu · 2024-04-15T09:50:00Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here] (https://enterprise-support.nvidia.com/s/create-case)..

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
Kernel Version: 5.15.0-102-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

I use helm to deploy k8s-device-plugin, and config mps, but deploy a workload running error. mps-controller-daemon pod having running.

3. Information to attach (optional if deemed irrelevant)

I use gpu-operator to install gpu driver, use helm chart version is v23.9.1, and driver、toolkit having install success. and then i use followers helm command to install k8s-device-plugin:

$ helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.15.0-rc.2     --namespace nvidia-device-plugin     --create-namespace     --set config.name=nvidia-plugin-configs --set gfd.enabled=true

nvidia-plugin-configs config content is :

    version: v1
    sharing:
      mps:
        resources:
        - name: nvidia.com/gpu
          replicas: 10

deploy workload pod command is:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

and then pod status is Error, error log is:

Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
[Vector addition of 50000 elements]

device-plugin pod log:

$ kubectl -n nvidia-device-plugin logs -f nvdp-nvidia-device-plugin-9dffl -c nvidia-device-plugin-ctr
I0415 09:32:35.712604      36 main.go:276]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 10
        }
      ]
    }
  }
}
I0415 09:32:35.712615      36 main.go:279] Retrieving plugins.
I0415 09:32:35.712627      36 factory.go:104] Detected NVML platform: found NVML library
I0415 09:32:35.712647      36 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0415 09:32:35.714289      36 server.go:176] "MPS daemon is healthy" resource="nvidia.com/gpu"
I0415 09:32:35.714450      36 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0415 09:32:35.714822      36 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0415 09:32:35.716273      36 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

mps-controller-daemon pod log:

$ kubectl -n nvidia-device-plugin logs -f nvdp-nvidia-device-plugin-mps-control-daemon-tgsfc -c mps-control-daemon-ctr
I0415 09:21:27.340107      50 main.go:183]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": null,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": null,
      "deviceListStrategy": null,
      "deviceIDStrategy": null,
      "cdiAnnotationPrefix": null,
      "nvidiaCTKPath": null,
      "containerDriverRoot": null
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "failRequestsGreaterThanOne": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 10
        }
      ]
    }
  }
}
I0415 09:21:27.340116      50 main.go:187] Retrieving MPS daemons.
I0415 09:21:27.375915      50 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
I0415 09:21:27.379471      50 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
[2024-04-15 09:21:27.377 Control    65] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2024-04-15 09:21:27.377 Control    65] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2024-04-15 09:21:27.378 Control    65] Accepting connection...
[2024-04-15 09:21:27.378 Control    65] NEW UI
[2024-04-15 09:21:27.378 Control    65] Cmd:set_default_device_pinned_mem_limit 0 2304M
[2024-04-15 09:21:27.378 Control    65] UI closed
[2024-04-15 09:21:27.379 Control    65] Accepting connection...
[2024-04-15 09:21:27.379 Control    65] NEW UI
[2024-04-15 09:21:27.379 Control    65] Cmd:set_default_active_thread_percentage 10
[2024-04-15 09:21:27.379 Control    65] 10.0
[2024-04-15 09:21:27.379 Control    65] UI closed
[2024-04-15 09:22:24.978 Control    65] Accepting connection...
[2024-04-15 09:22:24.978 Control    65] User did not send valid credentials
[2024-04-15 09:22:24.978 Control    65] Accepting connection...
[2024-04-15 09:22:24.978 Control    65] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-04-15 09:22:24.978 Control    65] Starting new server 75 for user 0
[2024-04-15 09:22:24.980 Control    65] Accepting connection...
[2024-04-15 09:22:25.207 Control    65] NEW SERVER 75: Ready
[2024-04-15 09:32:35.713 Control    65] Accepting connection...
[2024-04-15 09:32:35.713 Control    65] NEW UI
[2024-04-15 09:32:35.714 Control    65] Cmd:get_default_active_thread_percentage
[2024-04-15 09:32:35.714 Control    65] 10.0
[2024-04-15 09:32:35.714 Control    65] UI closed
[2024-04-15 09:33:08.579 Control    65] Accepting connection...
[2024-04-15 09:33:08.579 Control    65] User did not send valid credentials
[2024-04-15 09:33:08.579 Control    65] Accepting connection...
[2024-04-15 09:33:08.579 Control    65] NEW CLIENT 0 from user 0: Server already exists

GPU info :

root@nvidia-driver-daemonset-4p4qs:/drivers# nvidia-smi -L
GPU 0: Tesla P40 (UUID: GPU-70a7e30d-99a5-1117-8e85-759a592fb582)

The text was updated successfully, but these errors were encountered:

lengrongfu · 2024-04-17T08:15:28Z

@elezar Can you help me look into this issue?

elezar · 2024-04-17T10:37:51Z

Could you try to update your workload to use the following container instead:

nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1

Also, is the nvidia runtime configured as your default runtime, or are you using a runtime class? If it is the latter, you would also need to specify a runtime class in your workload.

lengrongfu · 2024-04-17T11:44:29Z

I use nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1 this image to deploy workload, error then having.

Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
[Vector addition of 50000 elements]

nvidia runtime is configured default.

$ cat /etc/containerd/config.toml
disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    enable_cdi = false
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_image_defined_volumes = false
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "easzlab.io.local:5000/easzlab/pause:3.9"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = "/etc/cni/net.d/10-default.conf"
      max_conf_num = 1

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

        [plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8".tls]
            insecure_skip_verify = true

        [plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000".tls]
            insecure_skip_verify = true

        [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443"]

          [plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443".tls]
            insecure_skip_verify = true

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://docker.nju.edu.cn/", "https://kuamavit.mirror.aliyuncs.com"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
          endpoint = ["http://easzlab.io.local:5000"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
          endpoint = ["https://gcr.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
          endpoint = ["https://ghcr.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.easzlab.io.local:8443"]
          endpoint = ["https://harbor.easzlab.io.local:8443"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
          endpoint = ["https://gcr.nju.edu.cn/google-containers/"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
          endpoint = ["https://ngc.nju.edu.cn"]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
          endpoint = ["https://quay.nju.edu.cn"]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.nri.v1.nri"]
    disable = false
    disable_connections = false
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins1"
    plugin_registration_timeout = "5s"
    plugin_request_timeout = "2s"
    socket_path = "/var/run/nri/nri.sock"

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.snapshotter.v1.aufs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.btrfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.devmapper"]
    async_remove = false
    base_image_size = ""
    pool_name = ""
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.zfs"]
    root_path = ""

[proxy_plugins]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0

lengrongfu · 2024-04-17T14:04:45Z

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

elezar · 2024-04-17T14:24:28Z

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?

lengrongfu · 2024-04-17T14:31:22Z

I found that it is possible to run the mps program directly on the host, but in the container it will prompt that device(s) is/are busy or unavailable

Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container?

First thanks for the quick answer.

About me in container to use MPS step:

i use gpu-operatpr to install driver and toolkit.
i use k8s-device-plugin to deploy mps controller and device-plugin

helm upgrade -i nvdp nvdp/nvidia-device-plugin     --version=0.15.0-rc.2     --namespace nvidia-device-plugin     --create-namespace     --set config.name=nvidia-plugin-configs --set gfd.enabled=true

then i to deploy a workload use follow command

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

About you tips MPS processes is /dev/shm to communication, what do I need to do with this?

lengrongfu · 2024-04-19T09:39:14Z

@elezar Need any more information?

elezar · 2024-04-23T07:31:06Z

Sorry for the delay, @lengrongfu. Since you're using the GPU Operator to install the other components of the NVIDIA Container Stack. Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace?

Also, to rule out any issues in the rc.2, could you deploy the v0.15.0 version of the device plugin that was released last week.

It would also be good to confirm that the workload container can properly access the MPS control daemon with the correct settings. Here, I would recommend updating the command to sleep 9999 and then exec into the container and run:

echo get_default_active_thread_percentage | mps-control-daemon

This should give 10 in your case.

lengrongfu · 2024-04-23T08:15:28Z

Thank you for your reply.

"Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace?" Confirm that it does not exist.

"Also, to rule out any issues in the rc.2, could you deploy the v0.15.0 version of the device plugin that was released last week." I use 0.15.0 version to deploy, this issue still exist.

use echo get_default_active_thread_percentage | mps-control-daemon exec log

test pods error info

lengrongfu · 2024-04-23T08:21:12Z

I watch mps-control-daemon having a log is User did not send valid credentials, will this have any impact?

in nvidia-driver-daemonset pod exec nvidia-smi command can see nvidia-cuda-mps-server process is in use GPU device.

GPU compute is Exclusive_Process:

elezar · 2024-04-23T08:34:31Z

Could you run:

echo get_default_active_thread_percentage | nvidia-cuda-mps-control

in a workload conatiner: For example, the following one:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
      command: ["bash", "-c"]
      args: ["nvidia-smi -L; sleep 9999"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

lengrongfu · 2024-04-23T13:35:13Z

echo get_default_active_thread_percentage | mps-control-daemon

mps-control-daemon: command not found

elezar · 2024-04-23T14:10:41Z

Sorry it should be echo get_default_active_thread_percentage | nvidia-cuda-mps-control. A typo from my side.

lengrongfu · 2024-04-23T14:19:32Z

It return value is 10.0.

elezar · 2024-04-23T14:35:48Z

Just as a sanity check, could you confirm that running nvidia-smi produces the same output as in the driver container?

Looking through the configs again, since the GPU Operator is being used configure the toolkit and the driver, I would expect the nvidiaDriverRoot for the device plugin to be set to /run/nvidia/driver and not:

    "nvidiaDriverRoot": "/",

as is shown in your config.

Could you update the device plugin deployment with --set nvidiaDriverRoot=/run/nvidia/driver?

lengrongfu · 2024-04-23T15:07:52Z

Use helm update nvidiaDriverRoot field, and add a volume todevice-plugin pod, then pod start success, but gpu-pod run error.

lengrongfu · 2024-04-25T06:14:34Z

Maybe it has something to do with Tesla P40.

robrakaric · 2024-04-30T23:37:40Z

Maybe it has something to do with Tesla P40.

I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help.

lengrongfu · 2024-05-06T01:38:21Z

Maybe it has something to do with Tesla P40.

I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help.

I don't now. but i use Nvidia T4 device can success running.

aphrodite1028 · 2024-05-08T13:43:40Z

Maybe it has something to do with Tesla P40.

I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help.

I don't now. but i use Nvidia T4 device can success running.

maybe pascal arch has problems using mps。if has correct client pod yaml for pascal arch gpu using mps?

elezar · 2024-05-08T13:59:54Z

There were significant improvements made to MPS with the release of Volta. It could be that our current implementation does not support pre-Volta devices. At present we have onl qualified volta devices.

aphrodite1028 · 2024-05-08T14:10:31Z

There were significant improvements made to MPS with the release of Volta. It could be that our current implementation does not support pre-Volta devices. At present we have onl qualified volta devices.
thanks for your reply
if has any planning to support pre-Volta gpu like Pascal P40,P100,P4，or if has tested on pre-Volta using MPS, but not support? Could you tell us how to support pre-Volta Arch GPU, like P-series, and what changes need to be made based on the v0.15.0 version of k8s-device-plugin?
We want to using mps on Pascal Arch GPU.Looking forward to your reply, thank you

lengrongfu · 2024-05-10T01:32:11Z

There were significant improvements made to MPS with the release of Volta. It could be that our current implementation does not support pre-Volta devices. At present we have onl qualified volta devices.

We should check the device arch in the start mps-control-daemon pod if the current does not support pre-Volta devices.? @elezar

PrakChandra · 2024-05-17T05:39:53Z

@elezar Could you please help here. I am not able to configure the MPS sharing option

here is the output
kubectl logs nvidia-device-plugin-daemonset-4p742 -c mps-control-daemon-ctr -n kube-system
I0517 04:07:02.596152 1 main.go:78] Starting NVIDIA MPS Control Daemon 435bfb7
commit: 435bfb7
I0517 04:07:02.596300 1 main.go:55] "Starting NVIDIA MPS Control Daemon" version=<
435bfb7
commit: 435bfb7

I0517 04:07:02.596318 1 main.go:107] Starting OS watcher.
I0517 04:07:02.596523 1 main.go:121] Starting Daemons.
I0517 04:07:02.596553 1 main.go:164] Loading configuration.
I0517 04:07:02.596616 1 main.go:172] Updating config with default resource matching patterns.
I0517 04:07:02.596875 1 main.go:183]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": null,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": null,
"deviceListStrategy": null,
"deviceIDStrategy": null,
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": null
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0517 04:07:02.596892 1 main.go:187] Retrieving MPS daemons.
I0517 04:07:02.596938 1 manager.go:58] "Sharing strategy is not MPS; skipping MPS manager creation" strategy="none"
I0517 04:07:02.596966 1 main.go:196] No devices are configured for MPS sharing; Waiting indefinitely.

PrakChandra · 2024-05-21T05:15:35Z

@channel

Could anyone please give some advice here?

quanguachong · 2024-05-22T07:42:04Z

I met the similar problem.

MPS works well in GPU A100-PCIE-40GB.

MPS not works in GPU TITAN X (Pascal), Pod vectoradd's image is nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2

$ k get pod
NAME                                         READY   STATUS    RESTARTS      AGE
vectoradd                                    0/1     Error     2 (26s ago)   30s
$ k logs vectoradd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

aphrodite1028 · 2024-05-22T07:52:34Z

I met the similar problem.

MPS works well in GPU A100-PCIE-40GB.

MPS not works in GPU TITAN X (Pascal), Pod vectoradd's image is nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
$ k get pod
NAME                                         READY   STATUS    RESTARTS      AGE
vectoradd                                    0/1     Error     2 (26s ago)   30s
$ k logs vectoradd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

your TITAN X GPU nvidia-smi how mush card can see? if your pod using one gpu card mps-control-daemon need to configed one gpu using NVIDIA-VISIBLE-DEVICES

klueska · 2024-05-22T07:59:59Z

@quanguachong

As @elezar mentioned above, the support we added for MPS implicitly only works for Volta+ GPUs. In pre-volta GPUs there was no ability to limit the memory of each MPS client, and our code assumes this functionality is available.

We should probably make this assumption explicit rather than implicit (or otherwise relax this constraint for pre-volta GPUs with a warning printed in the log).

bcc829 · 2024-05-27T14:26:39Z

Are there any plans to support MPS for pre-volta GPUs?

ZYWNB666 · 2024-06-30T09:54:34Z

If I want to use Time-Slicing in k8s, do I need to enable mps on node hosts?

lengrongfu · 2024-07-01T08:16:48Z

If I want to use Time-Slicing in k8s, do I need to enable mps on node hosts?

don't need, please reading https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#with-cuda-time-slicing this docs.

github-actions · 2024-09-30T04:28:38Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions · 2024-10-31T04:28:57Z

This issue was automatically closed due to inactivity.

haitwang-cloud mentioned this issue Apr 26, 2024

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode #679

Closed

lengrongfu mentioned this issue Jul 1, 2024

add warning when set memory limits check is pre-volta arch #797

Open

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2024

bcc829 mentioned this issue Sep 30, 2024

If all devices are pre-volta, skip setting set_default_device_pinned_mem_limit and set_default_active_thread_percentage. #971

Open

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

lengrongfu commented Apr 15, 2024 •

edited

Loading

lengrongfu commented Apr 17, 2024

elezar commented Apr 17, 2024

lengrongfu commented Apr 17, 2024

lengrongfu commented Apr 17, 2024

elezar commented Apr 17, 2024

lengrongfu commented Apr 17, 2024 •

edited

Loading

lengrongfu commented Apr 19, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024 •

edited

Loading

lengrongfu commented Apr 23, 2024 •

edited

Loading

elezar commented Apr 23, 2024 •

edited

Loading

lengrongfu commented Apr 23, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024

lengrongfu commented Apr 25, 2024

robrakaric commented Apr 30, 2024

lengrongfu commented May 6, 2024

aphrodite1028 commented May 8, 2024

elezar commented May 8, 2024

aphrodite1028 commented May 8, 2024 •

edited

Loading

lengrongfu commented May 10, 2024

PrakChandra commented May 17, 2024

PrakChandra commented May 21, 2024

quanguachong commented May 22, 2024

aphrodite1028 commented May 22, 2024

klueska commented May 22, 2024

bcc829 commented May 27, 2024

ZYWNB666 commented Jun 30, 2024

lengrongfu commented Jul 1, 2024

github-actions bot commented Sep 30, 2024

github-actions bot commented Oct 31, 2024

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647

Comments

lengrongfu commented Apr 15, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Information to attach (optional if deemed irrelevant)

lengrongfu commented Apr 17, 2024

elezar commented Apr 17, 2024

lengrongfu commented Apr 17, 2024

lengrongfu commented Apr 17, 2024

elezar commented Apr 17, 2024

lengrongfu commented Apr 17, 2024 • edited Loading

lengrongfu commented Apr 19, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024 • edited Loading

lengrongfu commented Apr 23, 2024 • edited Loading

elezar commented Apr 23, 2024 • edited Loading

lengrongfu commented Apr 23, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024

elezar commented Apr 23, 2024

lengrongfu commented Apr 23, 2024

lengrongfu commented Apr 25, 2024

robrakaric commented Apr 30, 2024

lengrongfu commented May 6, 2024

aphrodite1028 commented May 8, 2024

elezar commented May 8, 2024

aphrodite1028 commented May 8, 2024 • edited Loading

lengrongfu commented May 10, 2024

PrakChandra commented May 17, 2024

PrakChandra commented May 21, 2024

quanguachong commented May 22, 2024

aphrodite1028 commented May 22, 2024

klueska commented May 22, 2024

bcc829 commented May 27, 2024

ZYWNB666 commented Jun 30, 2024

lengrongfu commented Jul 1, 2024

github-actions bot commented Sep 30, 2024

github-actions bot commented Oct 31, 2024

lengrongfu commented Apr 15, 2024 •

edited

Loading

lengrongfu commented Apr 17, 2024 •

edited

Loading

lengrongfu commented Apr 23, 2024 •

edited

Loading

lengrongfu commented Apr 23, 2024 •

edited

Loading

elezar commented Apr 23, 2024 •

edited

Loading

aphrodite1028 commented May 8, 2024 •

edited

Loading