-
Notifications
You must be signed in to change notification settings - Fork 2k
Chaotic device name show in container`s /dev/ path and with GPU isolation #170
Comments
The names of the devices should not matter and applications should not depend on it, those are driver specific. Now, what you noticed is just |
It shouldn't, CUDA, nvidia-smi and NVML (the low-level library powering nvidia-smi) work perfectly fine in this case. They will report the right number of GPUs by ignoring the ones they don't have permission to use.
Not really, as you noticed, you don't have permissions to use the device and it's not being listed.
It's because bind-mounting a device or creating a device with mknod is not sufficient. See @3XX0 answer concerning cgroups If you really don't want the second device to show up, the following seems to work (but that's probably not a good idea and clearly overkill):
|
I was able to do this once out of ten times:
root@28deb9951062:~# ls /dev/nvidia* root@28deb9951062:~# nvidia-smi > /dev/null root@28deb9951062:~# ls /dev/nvidia* root@28deb9951062:~# echo "hello" > /dev/nvidia0 root@28deb9951062:~# echo "hello" > /dev/nvidia5 root@28deb9951062:~# echo "hello" > /dev/nvidia1 root@28deb9951062:~# nvidia-smi When I tried to reproduce it, I get the typical case of "bash: /dev/nvidia0: Operation not permitted". Except for the one I mapped in, and NVidia-smi keeps its integrity. So this a rare instance I have shown above. Is this a known issue? |
@anwald No this is not a known issue. Can you check the cgroups inside such a container?
|
I killed the container and can't reproduce what I did above. On later containers, I do check this and it looks fine. Weird. We run a multitenant cluster where we virtualize GPUs this way and we did have someone claim this happened to them. We were also surprised so learn that MKNOD is even a default privilege. I think we have decided to go and add --cap-drop MKNOD. We're looking for any good reasons to not do this (to minimize unintended consequences). Anything come to mind? |
With nvidia-docker 1.0.1, it should be fine since we now call If the issue happens again, please |
We have a live repro and have the container available for debugging now. This time it happened on a machine with only 2 x Tesla K40m. The container was started like so (cut out irrelevant parts like -v mapping with ...) docker run ... -v /var/drivers/nvidia/current:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --net=host --memory=52G --cpu-shares=512 --ulimit memlock=-1 --ulimit stack=67108864 --rm full-scale-image-tf-0.12.1 /bin/bash -c ... Normally, when the issue is not happening, the cgroups looks like this inside the container: But right now, it clearly looks abnormal: cat /sys/fs/cgroup/devices/devices.list So it looks like the container somehow breaks out of its cgroup limitations. I guess at this point one wouldn't care much about the minor numbers given the above: ls -l /dev/nvidia* NVidia-smi shows both devices, and the user is able to make use of both devices. I can write() to both device nodes. We use docker 1.10.3 and NVidia driver 367.55 for both the kernel-mode and user-mode libraries installed once outside the container (we map in the user-mode libraries as you can see at top). Any idea of how cgroups can break like that? Could this be a docker bug? I'm not sure what I can strace at this point since it looks like the damage is done. And this only happens a small fraction of the time so I'm not sure we want to enable strace in production. |
Yes, most likely a Docker bug. 1.10.3 is pretty old, IIRC cgroups were still handled directly by Docker back then. |
@anwald Have you solved the problem? we have the same issue. |
@x1957 Yes, we were able to solve this without bumping docker version. First, here is the problem: the NVidia driver is able to introspect the machine and lazily create dev nodes that are not there for each detected device. So, when you map -v /dev/nvidiaX:/dev/nvidiaX into the docker container, you can still end up with /dev/nvidiaY because the NVidia driver tries to be smart. Our solution was very simple: just add the --cap-drop MKNOD into the docker run command. This disallows any process in the namespace (process tree of the container) from creating any dev node, and was highly effective for us. Note that we are not using nvidia-docker, just docker (we map in the user-mode driver and ensure its version matches the loaded kernel mode driver). |
@anwald Thanks, I will try it. |
Drop `MKNOD` capability when starting container. `nvidia-smi` will make all gpu devices show up under /dev, drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this. Reference: NVIDIA/nvidia-docker#170
* Support nvidia container runtime for gpu isolation Support nvidia-container-runtime for gpu isolation. Set NVIDIA_VISIBLE_DEVICES to void to avoid conflict with runc, reference: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices. Works for w/ and w/o --runtime=nvidia. Closes #1667. * Set default runtime to runc Set runtime to runc explicitly to overwrite default runtime. * Drop `MKNOD` capability Drop `MKNOD` capability when starting container. `nvidia-smi` will make all gpu devices show up under /dev, drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this. Reference: NVIDIA/nvidia-docker#170
Problem Description:
Recently, I use nvidia-docker to run my container which used GPU resource. My requirement is one GPU device one container (i.e. in container there exist only /dev/nvidia0 or /dev/nvidia1).
NV_GPU=1 nvidia-docker run -ti --name container-1 nvidia/cuda bash
to start a container-1.When the container just started, the GPU device is shown as follow ( is expected ):
And, I test the /dev/nvidia1 with command
echo "hello" > /dev/nvidia1
. It is outputbash: echo: write error: Invalid argument
, is expected.2. Then I run
nvidia-smi
to list all GPU info. It is output: online GPU device with bus-id: 0000:03:00.03. Then, the important, when I execute ls
/dev/nvidia*
agian, the /dev/ list is changed. Output is:The /dev/nvidia0 appears. And I test it with follow cmd:
nvidia-smi
it still only one device with bus-id: 0000:03:00.0. No /dev/nvidia0 device.Detail Research:
So, with the problem above, I do some research.
docker run -ti --name container-2 --volume-driver=nvidia-docker --volume=nvidia_driver_352.79:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia1:/dev/nvidia0 nvidia/cuda bash
to replace nvidia-docker run (which is nvidia-docker really do).
And I switch the container mount dev path from /dev/nvidia1 to /dev/nvidia0. These will let /dev/nvidia1 in physical-host display into container as /dev/nvidia0.
So you can see, the physical-host
s device
/dev/nvidia1at first attached into container
s device/dev/nvidia0
, and it works well. But when executenvidia-smi
or any API of cuda (I have tested), the device name will be reloaded with the name as same as physical-host.And test again, I found
/dev/nvidia0
in container is failed to access, but the new appeared /dev/nvidia1 can be accessed.In this case,
nvidia-smi
still show CPU num is 0, and only list one device.This is so chaotic and break the isolation (at least the display isolation.)
So I want to find out why this problem come out.
With some research, I guess the reason to bring this problem:
But why /dev/nvidia0 cannot access after reloaded? And why /dev/nvidia1 can access?
So, anyone can help me fix this chaotic device name problem ?
The text was updated successfully, but these errors were encountered: