Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Chaotic device name show in container`s /dev/ path and with GPU isolation #170

Closed
fredy12 opened this issue Aug 12, 2016 · 12 comments
Closed

Comments

@fredy12
Copy link

fredy12 commented Aug 12, 2016

Problem Description:
Recently, I use nvidia-docker to run my container which used GPU resource. My requirement is one GPU device one container (i.e. in container there exist only /dev/nvidia0 or /dev/nvidia1).

  1. So I use NV_GPU=1 nvidia-docker run -ti --name container-1 nvidia/cuda bash to start a container-1.
    When the container just started, the GPU device is shown as follow ( is expected ):
# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia1  /dev/nvidiactl

And, I test the /dev/nvidia1 with command echo "hello" > /dev/nvidia1. It is output bash: echo: write error: Invalid argument , is expected.
2. Then I run nvidia-smi to list all GPU info. It is output: online GPU device with bus-id: 0000:03:00.0
3. Then, the important, when I execute ls /dev/nvidia* agian, the /dev/ list is changed. Output is:

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidia1  /dev/nvidiactl

The /dev/nvidia0 appears. And I test it with follow cmd:

# echo "hello" > /dev/nvidia0
bash: /dev/nvidia0: Operation not permitted
# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

nvidia-smi it still only one device with bus-id: 0000:03:00.0. No /dev/nvidia0 device.

  1. This will confused Apps which run in this container.

Detail Research:
So, with the problem above, I do some research.

  1. I use
    docker run -ti --name container-2 --volume-driver=nvidia-docker --volume=nvidia_driver_352.79:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia1:/dev/nvidia0 nvidia/cuda bash
    to replace nvidia-docker run (which is nvidia-docker really do).
    And I switch the container mount dev path from /dev/nvidia1 to /dev/nvidia0. These will let /dev/nvidia1 in physical-host display into container as /dev/nvidia0.
  2. After run, I enter the container and run:
# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidiactl

# echo "hello" > /dev/nvidia0
bash: echo: write error: Invalid argument

# nvidia-smi
Fri Aug 12 03:35:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ---(don`t care)---  Off  | 0000:03:00.0     Off |                    0 |
| ---(don`t care)---     |     ---(don`t care)---   |      0%      Default |
+-------------------------------+----------------------+----------------------+

# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidia1  /dev/nvidiactl

# echo "hello" > /dev/nvidia0
bash: /dev/nvidia0: Operation not permitted
# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

# nvidia-smi
Fri Aug 12 03:35:22 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  ---(don`t care)---  Off  | 0000:03:00.0     Off |                    0 |
| ---(don`t care)---     |     ---(don`t care)---   |      0%      Default |
+-------------------------------+----------------------+----------------------+
  1. The cmd execute serially with no more other operations.
    So you can see, the physical-hosts device/dev/nvidia1at first attached into containers device /dev/nvidia0, and it works well. But when execute nvidia-smi or any API of cuda (I have tested), the device name will be reloaded with the name as same as physical-host.
    And test again, I found /dev/nvidia0 in container is failed to access, but the new appeared /dev/nvidia1 can be accessed.
    In this case, nvidia-smi still show CPU num is 0, and only list one device.

This is so chaotic and break the isolation (at least the display isolation.)

So I want to find out why this problem come out.
With some research, I guess the reason to bring this problem:

  1. The container shared physical-host nvidia GPU driver. And nvidia GPU driver will execute nvidia-modprobe when be called. Seen: nvidia-moprobe description
  2. Because of shared GPU driver, so when nvidia-modprobe called, it will reloaded physical-host GPU device info ( which will loaded /dev/nvidia0 and /dev/nvidia1 or more ).
  3. Then nvidia-modprobe will recreate device file in /dev/ path. So in the container-2, /dev/nvidia1 comes out.

But why /dev/nvidia0 cannot access after reloaded? And why /dev/nvidia1 can access?

So, anyone can help me fix this chaotic device name problem ?

@3XX0
Copy link
Member

3XX0 commented Aug 12, 2016

The names of the devices should not matter and applications should not depend on it, those are driver specific.

Now, what you noticed is just nvidia-smi creating devices that it detected. GPUs are still isolated correctly (using CGroups) as shown in your output. The Operation not permitted on the second device is the result of such isolation.

@flx42
Copy link
Member

flx42 commented Aug 12, 2016

This will confused Apps which run in this container.

It shouldn't, CUDA, nvidia-smi and NVML (the low-level library powering nvidia-smi) work perfectly fine in this case. They will report the right number of GPUs by ignoring the ones they don't have permission to use.

This is so chaotic and break the isolation (at least the display isolation.)

Not really, as you noticed, you don't have permissions to use the device and it's not being listed.
You only know there is another device. But for instance using lspci inside a container would also tell you that, even if you import no device:

$ docker run -ti ubuntu:14.04
root@05770cae7bb1:/# apt-get update && apt-get install -y pciutils
[...]
root@05770cae7bb1:/# lspci | grep VGA    
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)

But why /dev/nvidia0 cannot access after reloaded? And why /dev/nvidia1 can access?

It's because bind-mounting a device or creating a device with mknod is not sufficient. See @3XX0 answer concerning cgroups

If you really don't want the second device to show up, the following seems to work (but that's probably not a good idea and clearly overkill):

$ NV_GPU=1 nvidia-docker run -ti --rm --cap-drop MKNOD nvidia/cuda
root@0e558568bac8:/# nvidia-smi 
[...]
root@0e558568bac8:/# ls /dev/nvidia*
/dev/nvidia-uvm  /dev/nvidia-uvm-tools  /dev/nvidia1  /dev/nvidiactl

@fredy12
Copy link
Author

fredy12 commented Aug 12, 2016

Got it. Thanks very much for reply @3XX0 @flx42 . Your answers are very useful for me.

@flx42 flx42 closed this as completed Aug 12, 2016
@anwald
Copy link

anwald commented Mar 28, 2017

@3XX0 @flx42

I was able to do this once out of ten times:

core@phlrr3103 ~ $ docker run -ti -v /var/drivers/nvidia/current:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia5 private-cntk-scale-image_cuda.8.0-cudnn.5.1.10-nccl.1.3.0-opencv.3.1.0-openmpi.1.10.3 bash

root@28deb9951062:~# ls /dev/nvidia*
/dev/nvidia-uvm /dev/nvidia5 /dev/nvidiactl

root@28deb9951062:~# nvidia-smi > /dev/null

root@28deb9951062:~# ls /dev/nvidia*
/dev/nvidia-uvm /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidiactl

root@28deb9951062:~# echo "hello" > /dev/nvidia0
bash: echo: write error: Invalid argument

root@28deb9951062:~# echo "hello" > /dev/nvidia5
bash: echo: write error: Invalid argument

root@28deb9951062:~# echo "hello" > /dev/nvidia1
bash: echo: write error: Invalid argument

root@28deb9951062:~# nvidia-smi
Tue Mar 28 03:03:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.55 Driver Version: 367.55 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB Off | 0000:05:00.0 Off | 0 |
| N/A 25C P8 18W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 24GB Off | 0000:08:00.0 Off | 0 |
| N/A 26C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M40 24GB Off | 0000:0D:00.0 Off | 0 |
| N/A 24C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M40 24GB Off | 0000:13:00.0 Off | 0 |
| N/A 23C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla M40 24GB Off | 0000:83:00.0 Off | 0 |
| N/A 26C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla M40 24GB Off | 0000:89:00.0 Off | 0 |
| N/A 26C P8 16W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla M40 24GB Off | 0000:8E:00.0 Off | 0 |
| N/A 25C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla M40 24GB Off | 0000:91:00.0 Off | 0 |
| N/A 26C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

When I tried to reproduce it, I get the typical case of "bash: /dev/nvidia0: Operation not permitted". Except for the one I mapped in, and NVidia-smi keeps its integrity.

So this a rare instance I have shown above. Is this a known issue?

@flx42
Copy link
Member

flx42 commented Mar 28, 2017

@anwald No this is not a known issue. Can you check the cgroups inside such a container?

$ cat /sys/fs/cgroup/devices/devices.list

@anwald
Copy link

anwald commented Mar 28, 2017

I killed the container and can't reproduce what I did above. On later containers, I do check this and it looks fine. Weird. We run a multitenant cluster where we virtualize GPUs this way and we did have someone claim this happened to them. We were also surprised so learn that MKNOD is even a default privilege. I think we have decided to go and add --cap-drop MKNOD. We're looking for any good reasons to not do this (to minimize unintended consequences). Anything come to mind?

@flx42
Copy link
Member

flx42 commented Mar 28, 2017

With nvidia-docker 1.0.1, it should be fine since we now call nvidia-modprobe each time, but we haven't tested this configuration extensively.
@3XX0 any additional comment?

If the issue happens again, please strace the process and also do ls -l /dev/nvidia* so we can check the major/minor numbers.

@anwald
Copy link

anwald commented Mar 30, 2017

We have a live repro and have the container available for debugging now. This time it happened on a machine with only 2 x Tesla K40m.

The container was started like so (cut out irrelevant parts like -v mapping with ...)

docker run ... -v /var/drivers/nvidia/current:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --net=host --memory=52G --cpu-shares=512 --ulimit memlock=-1 --ulimit stack=67108864 --rm full-scale-image-tf-0.12.1 /bin/bash -c ...

Normally, when the issue is not happening, the cgroups looks like this inside the container:
cat /sys/fs/cgroup/devices/devices.list
c : m
b : m
c 5:1 rwm
c 4:0 rwm
c 4:1 rwm
c 136:* rwm
c 5:2 rwm
c 10:200 rwm
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 231:64 rwm
c 231:65 rwm
c 10:57 rwm
c 231:224 rwm
c 231:0 rwm
c 231:1 rwm
c 231:192 rwm
c 195:255 rwm
c 246:0 rwm
c 195:3 rwm (for example here we mapped in /dev/nvidia3)

But right now, it clearly looks abnormal:

cat /sys/fs/cgroup/devices/devices.list
a : rwm

So it looks like the container somehow breaks out of its cgroup limitations. I guess at this point one wouldn't care much about the minor numbers given the above:

ls -l /dev/nvidia*
crw-rw-rw-. 1 root root 246, 0 Mar 28 18:42 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 195, 0 Mar 28 18:42 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 1 Mar 28 18:42 /dev/nvidia1
crw-rw-rw-. 1 root root 195, 255 Mar 28 18:42 /dev/nvidiactl

NVidia-smi shows both devices, and the user is able to make use of both devices. I can write() to both device nodes.

We use docker 1.10.3 and NVidia driver 367.55 for both the kernel-mode and user-mode libraries installed once outside the container (we map in the user-mode libraries as you can see at top).

Any idea of how cgroups can break like that? Could this be a docker bug? I'm not sure what I can strace at this point since it looks like the damage is done. And this only happens a small fraction of the time so I'm not sure we want to enable strace in production.

@3XX0
Copy link
Member

3XX0 commented Mar 30, 2017

Yes, most likely a Docker bug. 1.10.3 is pretty old, IIRC cgroups were still handled directly by Docker back then.

@x1957
Copy link

x1957 commented Sep 20, 2017

@anwald Have you solved the problem? we have the same issue.

@anwald
Copy link

anwald commented Sep 20, 2017

@x1957 Yes, we were able to solve this without bumping docker version.

First, here is the problem: the NVidia driver is able to introspect the machine and lazily create dev nodes that are not there for each detected device. So, when you map -v /dev/nvidiaX:/dev/nvidiaX into the docker container, you can still end up with /dev/nvidiaY because the NVidia driver tries to be smart. Our solution was very simple: just add the --cap-drop MKNOD into the docker run command. This disallows any process in the namespace (process tree of the container) from creating any dev node, and was highly effective for us. Note that we are not using nvidia-docker, just docker (we map in the user-mode driver and ensure its version matches the loaded kernel mode driver).

@x1957
Copy link

x1957 commented Sep 21, 2017

@anwald Thanks, I will try it.

abuccts added a commit to microsoft/pai that referenced this issue Mar 19, 2019
Drop `MKNOD` capability when starting container.
`nvidia-smi` will make all gpu devices show up under /dev,
drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this.

Reference: NVIDIA/nvidia-docker#170
abuccts added a commit to microsoft/pai that referenced this issue Mar 20, 2019
* Support nvidia container runtime for gpu isolation

Support nvidia-container-runtime for gpu isolation.
Set NVIDIA_VISIBLE_DEVICES to void to avoid conflict with runc, reference:
https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices.

Works for w/ and w/o --runtime=nvidia.

Closes #1667.

* Set default runtime to runc

Set runtime to runc explicitly to overwrite default runtime.

* Drop `MKNOD` capability

Drop `MKNOD` capability when starting container.
`nvidia-smi` will make all gpu devices show up under /dev,
drop [mknod](https://linux.die.net/man/2/mknod) capability to avoid this.

Reference: NVIDIA/nvidia-docker#170
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants