-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! #647
Comments
@elezar Can you help me look into this issue? |
Could you try to update your workload to use the following container instead:
Also, is the |
I use
$ cat /etc/containerd/config.toml
disabled_plugins = []
imports = []
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2
[cgroup]
path = ""
[debug]
address = ""
format = ""
gid = 0
level = ""
uid = 0
[grpc]
address = "/run/containerd/containerd.sock"
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
tcp_address = ""
tcp_tls_ca = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
[metrics]
address = ""
grpc_histogram = false
[plugins]
[plugins."io.containerd.gc.v1.scheduler"]
deletion_threshold = 0
mutation_threshold = 100
pause_threshold = 0.02
schedule_delay = "0s"
startup_delay = "100ms"
[plugins."io.containerd.grpc.v1.cri"]
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
device_ownership_from_security_context = false
disable_apparmor = false
disable_cgroup = false
disable_hugetlb_controller = true
disable_proc_mount = false
disable_tcp_service = true
enable_cdi = false
enable_selinux = false
enable_tls_streaming = false
enable_unprivileged_icmp = false
enable_unprivileged_ports = false
ignore_image_defined_volumes = false
max_concurrent_downloads = 3
max_container_log_line_size = 16384
netns_mounts_under_state_dir = false
restrict_oom_score_adj = false
sandbox_image = "easzlab.io.local:5000/easzlab/pause:3.9"
selinux_category_range = 1024
stats_collect_period = 10
stream_idle_timeout = "4h0m0s"
stream_server_address = "127.0.0.1"
stream_server_port = "0"
systemd_cgroup = false
tolerate_missing_hugetlb_controller = true
unset_seccomp_profile = ""
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
conf_template = "/etc/cni/net.d/10-default.conf"
max_conf_num = 1
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
ignore_rdt_not_enabled_errors = false
no_pivot = false
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = "node"
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.auths]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."10.6.194.8".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."easzlab.io.local:5000".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."harbor.easzlab.io.local:8443".tls]
insecure_skip_verify = true
[plugins."io.containerd.grpc.v1.cri".registry.headers]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker.nju.edu.cn/", "https://kuamavit.mirror.aliyuncs.com"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."easzlab.io.local:5000"]
endpoint = ["http://easzlab.io.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
endpoint = ["https://gcr.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
endpoint = ["https://ghcr.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."harbor.easzlab.io.local:8443"]
endpoint = ["https://harbor.easzlab.io.local:8443"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
endpoint = ["https://gcr.nju.edu.cn/google-containers/"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."nvcr.io"]
endpoint = ["https://ngc.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
endpoint = ["https://quay.nju.edu.cn"]
[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
tls_cert_file = ""
tls_key_file = ""
[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"
[plugins."io.containerd.internal.v1.restart"]
interval = "10s"
[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.nri.v1.nri"]
disable = false
disable_connections = false
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins1"
plugin_registration_timeout = "5s"
plugin_request_timeout = "2s"
socket_path = "/var/run/nri/nri.sock"
[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "runc"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false
[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]
[plugins."io.containerd.snapshotter.v1.aufs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.btrfs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.devmapper"]
async_remove = false
base_image_size = ""
pool_name = ""
root_path = ""
[plugins."io.containerd.snapshotter.v1.native"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.overlayfs"]
root_path = ""
[plugins."io.containerd.snapshotter.v1.zfs"]
root_path = ""
[proxy_plugins]
[stream_processors]
[stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar"
[stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar+gzip"
[timeouts]
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"
[ttrpc]
address = ""
gid = 0
uid = 0 |
I found that it is possible to run the mps program directly on the host, but in the container it will prompt that |
Could you provide more information on how you achieved this. Note that one of the key communication mechanism between the MPS processes is the /dev/shm that we create for the containerized daemon. How are you injecting this into the container? |
First thanks for the quick answer. About me in container to use MPS step:
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF About you tips MPS processes is |
@elezar Need any more information? |
Sorry for the delay, @lengrongfu. Since you're using the GPU Operator to install the other components of the NVIDIA Container Stack. Can you confirm that it isn't managing the device plugin? Which pods are running in the GPU Operator namespace? Also, to rule out any issues in the It would also be good to confirm that the workload container can properly access the MPS control daemon with the correct settings. Here, I would recommend updating the command to
This should give |
Could you run:
in a workload conatiner: For example, the following one:
|
Sorry it should be |
Just as a sanity check, could you confirm that running Looking through the configs again, since the GPU Operator is being used configure the toolkit and the driver, I would expect the
as is shown in your config. Could you update the device plugin deployment with |
Maybe it has something to do with |
I'm running into the same issue on a GTX1070 w/ the same driver version as you. I wonder if a driver update would help. |
I don't now. but i use |
maybe pascal arch has problems using mps。if has correct client pod yaml for pascal arch gpu using mps? |
There were significant improvements made to MPS with the release of Volta. It could be that our current implementation does not support pre-Volta devices. At present we have onl qualified volta devices. |
|
We should check the device arch in the start |
@elezar Could you please help here. I am not able to configure the MPS sharing option here is the output I0517 04:07:02.596318 1 main.go:107] Starting OS watcher. |
Could anyone please give some advice here? |
I met the similar problem. MPS works well in GPU A100-PCIE-40GB. MPS not works in GPU TITAN X (Pascal), Pod vectoradd's image is $ k get pod
NAME READY STATUS RESTARTS AGE
vectoradd 0/1 Error 2 (26s ago) 30s
$ k logs vectoradd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! |
your TITAN X GPU nvidia-smi how mush card can see? if your pod using one gpu card mps-control-daemon need to configed one gpu using NVIDIA-VISIBLE-DEVICES |
As @elezar mentioned above, the support we added for MPS implicitly only works for Volta+ GPUs. In pre-volta GPUs there was no ability to limit the memory of each MPS client, and our code assumes this functionality is available. We should probably make this assumption explicit rather than implicit (or otherwise relax this constraint for pre-volta GPUs with a warning printed in the log). |
Are there any plans to support MPS for pre-volta GPUs? |
If I want to use Time-Slicing in k8s, do I need to enable mps on node hosts? |
don't need, please reading https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#with-cuda-time-slicing this docs. |
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. |
This issue was automatically closed due to inactivity. |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
**Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here] (https://enterprise-support.nvidia.com/s/create-case)..
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
I use helm to deploy k8s-device-plugin, and config mps, but deploy a workload running error. mps-controller-daemon pod having running.
3. Information to attach (optional if deemed irrelevant)
I use
gpu-operator
to install gpu driver, use helm chart version is v23.9.1, and driver、toolkit having install success. and then i use followers helm command to install k8s-device-plugin:nvidia-plugin-configs config content is :
deploy workload pod command is:
and then pod status is Error, error log is:
Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! [Vector addition of 50000 elements]
device-plugin
pod log:mps-controller-daemon
pod log:GPU info :
The text was updated successfully, but these errors were encountered: