Chaotic device name show in container`s /dev/ path and with GPU isolation #170

fredy12 · 2016-08-12T04:43:17Z

Problem Description:
Recently, I use nvidia-docker to run my container which used GPU resource. My requirement is one GPU device one container (i.e. in container there exist only /dev/nvidia0 or /dev/nvidia1).

So I use NV_GPU=1 nvidia-docker run -ti --name container-1 nvidia/cuda bash to start a container-1.
When the container just started, the GPU device is shown as follow ( is expected ):

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia1  /dev/nvidiactl

And, I test the /dev/nvidia1 with command echo "hello" > /dev/nvidia1. It is output bash: echo: write error: Invalid argument , is expected.
2. Then I run nvidia-smi to list all GPU info. It is output: online GPU device with bus-id: 0000:03:00.0
3. Then, the important, when I execute ls /dev/nvidia* agian, the /dev/ list is changed. Output is:

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidia1  /dev/nvidiactl

The /dev/nvidia0 appears. And I test it with follow cmd:

# echo "hello" > /dev/nvidia0
bash: /dev/nvidia0: Operation not permitted
# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

nvidia-smi it still only one device with bus-id: 0000:03:00.0. No /dev/nvidia0 device.

This will confused Apps which run in this container.

Detail Research:
So, with the problem above, I do some research.

I use
docker run -ti --name container-2 --volume-driver=nvidia-docker --volume=nvidia_driver_352.79:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia1:/dev/nvidia0 nvidia/cuda bash
to replace nvidia-docker run (which is nvidia-docker really do).
And I switch the container mount dev path from /dev/nvidia1 to /dev/nvidia0. These will let /dev/nvidia1 in physical-host display into container as /dev/nvidia0.
After run, I enter the container and run:

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidiactl

# echo "hello" > /dev/nvidia0
bash: echo: write error: Invalid argument

# nvidia-smi
Fri Aug 12 03:35:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ---(don`t care)---  Off  | 0000:03:00.0     Off |                    0 |
| ---(don`t care)---     |     ---(don`t care)---   |      0%      Default |
+-------------------------------+----------------------+----------------------+

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidia1  /dev/nvidiactl

# echo "hello" > /dev/nvidia0
bash: /dev/nvidia0: Operation not permitted
# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

# nvidia-smi
Fri Aug 12 03:35:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ---(don`t care)---  Off  | 0000:03:00.0     Off |                    0 |
| ---(don`t care)---     |     ---(don`t care)---   |      0%      Default |
+-------------------------------+----------------------+----------------------+

The cmd execute serially with no more other operations.
So you can see, the physical-hosts device/dev/nvidia1at first attached into containers device /dev/nvidia0, and it works well. But when execute nvidia-smi or any API of cuda (I have tested), the device name will be reloaded with the name as same as physical-host.
And test again, I found /dev/nvidia0 in container is failed to access, but the new appeared /dev/nvidia1 can be accessed.
In this case, nvidia-smi still show CPU num is 0, and only list one device.

This is so chaotic and break the isolation (at least the display isolation.)

So I want to find out why this problem come out.
With some research, I guess the reason to bring this problem:

The container shared physical-host nvidia GPU driver. And nvidia GPU driver will execute nvidia-modprobe when be called. Seen: nvidia-moprobe description
Because of shared GPU driver, so when nvidia-modprobe called, it will reloaded physical-host GPU device info ( which will loaded /dev/nvidia0 and /dev/nvidia1 or more ).
Then nvidia-modprobe will recreate device file in /dev/ path. So in the container-2, /dev/nvidia1 comes out.

But why /dev/nvidia0 cannot access after reloaded? And why /dev/nvidia1 can access?

So, anyone can help me fix this chaotic device name problem ?

The text was updated successfully, but these errors were encountered:

3XX0 · 2016-08-12T05:06:33Z

The names of the devices should not matter and applications should not depend on it, those are driver specific.

Now, what you noticed is just nvidia-smi creating devices that it detected. GPUs are still isolated correctly (using CGroups) as shown in your output. The Operation not permitted on the second device is the result of such isolation.

flx42 · 2016-08-12T05:10:15Z

This will confused Apps which run in this container.

It shouldn't, CUDA, nvidia-smi and NVML (the low-level library powering nvidia-smi) work perfectly fine in this case. They will report the right number of GPUs by ignoring the ones they don't have permission to use.

This is so chaotic and break the isolation (at least the display isolation.)

Not really, as you noticed, you don't have permissions to use the device and it's not being listed.
You only know there is another device. But for instance using lspci inside a container would also tell you that, even if you import no device:

$ docker run -ti ubuntu:14.04
root@05770cae7bb1:/# apt-get update && apt-get install -y pciutils
[...]
root@05770cae7bb1:/# lspci | grep VGA    
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)

But why /dev/nvidia0 cannot access after reloaded? And why /dev/nvidia1 can access?

It's because bind-mounting a device or creating a device with mknod is not sufficient. See @3XX0 answer concerning cgroups

If you really don't want the second device to show up, the following seems to work (but that's probably not a good idea and clearly overkill):

$ NV_GPU=1 nvidia-docker run -ti --rm --cap-drop MKNOD nvidia/cuda
root@0e558568bac8:/# nvidia-smi 
[...]
root@0e558568bac8:/# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia-uvm-tools  /dev/nvidia1  /dev/nvidiactl

fredy12 · 2016-08-12T05:31:31Z

Got it. Thanks very much for reply @3XX0 @flx42 . Your answers are very useful for me.

anwald · 2017-03-28T03:25:00Z

@3XX0 @flx42

I was able to do this once out of ten times:

core@phlrr3103 ~ $ docker run -ti -v /var/drivers/nvidia/current:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia5 private-cntk-scale-image_cuda.8.0-cudnn.5.1.10-nccl.1.3.0-opencv.3.1.0-openmpi.1.10.3 bash

root@28deb9951062:~# ls /dev/nvidia*
/dev/nvidia-uvm /dev/nvidia5 /dev/nvidiactl

root@28deb9951062:~# nvidia-smi > /dev/null

root@28deb9951062:~# ls /dev/nvidia*
/dev/nvidia-uvm /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidiactl

root@28deb9951062:~# echo "hello" > /dev/nvidia0
bash: echo: write error: Invalid argument

root@28deb9951062:~# echo "hello" > /dev/nvidia5
bash: echo: write error: Invalid argument

root@28deb9951062:~# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

When I tried to reproduce it, I get the typical case of "bash: /dev/nvidia0: Operation not permitted". Except for the one I mapped in, and NVidia-smi keeps its integrity.

So this a rare instance I have shown above. Is this a known issue?

flx42 · 2017-03-28T04:57:56Z

@anwald No this is not a known issue. Can you check the cgroups inside such a container?

$ cat /sys/fs/cgroup/devices/devices.list

anwald · 2017-03-28T05:34:53Z

I killed the container and can't reproduce what I did above. On later containers, I do check this and it looks fine. Weird. We run a multitenant cluster where we virtualize GPUs this way and we did have someone claim this happened to them. We were also surprised so learn that MKNOD is even a default privilege. I think we have decided to go and add --cap-drop MKNOD. We're looking for any good reasons to not do this (to minimize unintended consequences). Anything come to mind?

flx42 · 2017-03-28T05:48:16Z

With nvidia-docker 1.0.1, it should be fine since we now call nvidia-modprobe each time, but we haven't tested this configuration extensively.
@3XX0 any additional comment?

If the issue happens again, please strace the process and also do ls -l /dev/nvidia* so we can check the major/minor numbers.

anwald · 2017-03-30T00:31:38Z

We have a live repro and have the container available for debugging now. This time it happened on a machine with only 2 x Tesla K40m.

The container was started like so (cut out irrelevant parts like -v mapping with ...)

docker run ... -v /var/drivers/nvidia/current:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --net=host --memory=52G --cpu-shares=512 --ulimit memlock=-1 --ulimit stack=67108864 --rm full-scale-image-tf-0.12.1 /bin/bash -c ...

Normally, when the issue is not happening, the cgroups looks like this inside the container:
cat /sys/fs/cgroup/devices/devices.list
c : m
b : m
c 5:1 rwm
c 4:0 rwm
c 4:1 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 231:64 rwm
c 231:65 rwm
c 10:57 rwm
c 231:224 rwm
c 231:0 rwm
c 231:1 rwm
c 231:192 rwm
c 195:255 rwm
c 246:0 rwm
c 195:3 rwm (for example here we mapped in /dev/nvidia3)

But right now, it clearly looks abnormal:

cat /sys/fs/cgroup/devices/devices.list
a : rwm

So it looks like the container somehow breaks out of its cgroup limitations. I guess at this point one wouldn't care much about the minor numbers given the above:

ls -l /dev/nvidia*
crw-rw-rw-. 1 root root 246, 0 Mar 28 18:42 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 195, 0 Mar 28 18:42 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 1 Mar 28 18:42 /dev/nvidia1
crw-rw-rw-. 1 root root 195, 255 Mar 28 18:42 /dev/nvidiactl

NVidia-smi shows both devices, and the user is able to make use of both devices. I can write() to both device nodes.

We use docker 1.10.3 and NVidia driver 367.55 for both the kernel-mode and user-mode libraries installed once outside the container (we map in the user-mode libraries as you can see at top).

Any idea of how cgroups can break like that? Could this be a docker bug? I'm not sure what I can strace at this point since it looks like the damage is done. And this only happens a small fraction of the time so I'm not sure we want to enable strace in production.

3XX0 · 2017-03-30T02:22:00Z

Yes, most likely a Docker bug. 1.10.3 is pretty old, IIRC cgroups were still handled directly by Docker back then.

x1957 · 2017-09-20T15:20:22Z

@anwald Have you solved the problem? we have the same issue.

anwald · 2017-09-20T23:18:47Z

@x1957 Yes, we were able to solve this without bumping docker version.

First, here is the problem: the NVidia driver is able to introspect the machine and lazily create dev nodes that are not there for each detected device. So, when you map -v /dev/nvidiaX:/dev/nvidiaX into the docker container, you can still end up with /dev/nvidiaY because the NVidia driver tries to be smart. Our solution was very simple: just add the --cap-drop MKNOD into the docker run command. This disallows any process in the namespace (process tree of the container) from creating any dev node, and was highly effective for us. Note that we are not using nvidia-docker, just docker (we map in the user-mode driver and ensure its version matches the loaded kernel mode driver).

x1957 · 2017-09-21T02:22:58Z

@anwald Thanks, I will try it.

Drop `MKNOD` capability when starting container. `nvidia-smi` will make all gpu devices show up under /dev, drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this. Reference: NVIDIA/nvidia-docker#170

* Support nvidia container runtime for gpu isolation Support nvidia-container-runtime for gpu isolation. Set NVIDIA_VISIBLE_DEVICES to void to avoid conflict with runc, reference: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices. Works for w/ and w/o --runtime=nvidia. Closes #1667. * Set default runtime to runc Set runtime to runc explicitly to overwrite default runtime. * Drop `MKNOD` capability Drop `MKNOD` capability when starting container. `nvidia-smi` will make all gpu devices show up under /dev, drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this. Reference: NVIDIA/nvidia-docker#170

flx42 closed this as completed Aug 12, 2016

flx42 mentioned this issue Apr 30, 2017

k8s v1.6 GPU not isolated kubernetes/kubernetes#44834

Closed

abuccts mentioned this issue Mar 19, 2019

Support nvidia-container-runtime for gpu isolation microsoft/pai#2352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaotic device name show in container`s /dev/ path and with GPU isolation #170

Chaotic device name show in container`s /dev/ path and with GPU isolation #170

fredy12 commented Aug 12, 2016

3XX0 commented Aug 12, 2016

flx42 commented Aug 12, 2016

fredy12 commented Aug 12, 2016

anwald commented Mar 28, 2017 •

edited

Loading

flx42 commented Mar 28, 2017

anwald commented Mar 28, 2017 •

edited

Loading

flx42 commented Mar 28, 2017

anwald commented Mar 30, 2017 •

edited

Loading

3XX0 commented Mar 30, 2017

x1957 commented Sep 20, 2017 •

edited

Loading

anwald commented Sep 20, 2017

x1957 commented Sep 21, 2017

Chaotic device name show in container`s /dev/ path and with GPU isolation #170

Chaotic device name show in container`s /dev/ path and with GPU isolation #170

Comments

fredy12 commented Aug 12, 2016

3XX0 commented Aug 12, 2016

flx42 commented Aug 12, 2016

fredy12 commented Aug 12, 2016

anwald commented Mar 28, 2017 • edited Loading

flx42 commented Mar 28, 2017

anwald commented Mar 28, 2017 • edited Loading

flx42 commented Mar 28, 2017

anwald commented Mar 30, 2017 • edited Loading

3XX0 commented Mar 30, 2017

x1957 commented Sep 20, 2017 • edited Loading

anwald commented Sep 20, 2017

x1957 commented Sep 21, 2017

anwald commented Mar 28, 2017 •

edited

Loading

anwald commented Mar 28, 2017 •

edited

Loading

anwald commented Mar 30, 2017 •

edited

Loading

x1957 commented Sep 20, 2017 •

edited

Loading