Unable to use nvidia-docker 2.0 with driver version 390.30 #849

goodluckbot · 2018-10-29T04:14:17Z

1. Issue or feature description

Not able to use nvidia-docker2.0 with driver version 390.30
And get error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: 
starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, 
stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure 
--ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 
brand=tesla,driver>=384,driver<385 --pid=29004 
/data/docker_rt/overlay2/65412291f76b94894ec8cdd271e3a7534a60a3e38a28e79fbc25d4276d70a215/merged]
\\\\nnvidia-container-cli: initialization error: cuda error: initialization error\\\\n\\\"\"": unknown.

2. Steps to reproduce the issue

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

3. Information to attach (optional if deemed irrelevant)

Kernel version from uname -a

Linux hostname 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Any relevant kernel output lines from dmesg

[Mon Oct 29 12:07:38 2018] device vethcd22dec entered promiscuous mode
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered forwarding state
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered forwarding state
[Mon Oct 29 12:07:39 2018] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 236
[Mon Oct 29 12:07:39 2018] NVRM: GPU at PCI:0000:8a:00: GPU-aa1a2c8f-eacf-29be-47b0-fbf8c6491ed9
[Mon Oct 29 12:07:39 2018] NVRM: GPU Board Serial Number: 0321116070750
[Mon Oct 29 12:07:39 2018] NVRM: Xid (PCI:0000:8a:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 4, subpartition 1.
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:39 2018] device vethcd22dec left promiscuous mode
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:40 2018] NVRM: Xid (PCI:0000:8a:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce8ac).
[Mon Oct 29 12:39:50 2018] device vetha1169e8 entered promiscuous mode
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered forwarding state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered forwarding state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state
[Mon Oct 29 12:39:50 2018] device vetha1169e8 left promiscuous mode
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state

Driver information from nvidia-smi -a

==============NVSMI LOG==============

Timestamp                           : Mon Oct 29 12:28:27 2018
Driver Version                      : 390.30

Attached GPUs                       : 8
GPU 00000000:05:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
......

Docker version from docker version

Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false

NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
output of rpm -qa '*nvidia*'

libnvidia-container1-1.0.0-1.x86_64
nvidia-container-runtime-2.0.0-1.docker18.06.1.x86_64
nvidia-container-runtime-hook-1.4.0-2.x86_64
nvidia-docker2-2.0.3-1.docker18.06.1.ce.noarch
pcp-pmda-nvidia-gpu-3.10.6-2.el7.x86_64
libnvidia-container-tools-1.0.0-1.x86_64

NVIDIA container library version from nvidia-container-cli -V

version: 1.0.0
build date: 2018-09-20T20:25+0000
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

NVIDIA container library logs (see troubleshooting)
cat /var/log/nvidia-container-runtime-hook.log

-- WARNING, the following logs are for debugging purposes only --

I1029 04:39:17.489945 7562 nvc.c:281] initializing library context (version=1.0.0, build=881c88e2e5bb682c9bb14e68bd165cfb64563bb1)
I1029 04:39:17.490054 7562 nvc.c:255] using root /
I1029 04:39:17.490067 7562 nvc.c:256] using ldcache /etc/ld.so.cache
I1029 04:39:17.490077 7562 nvc.c:257] using unprivileged user 65534:65534
I1029 04:39:17.495224 7568 nvc.c:191] loading kernel module nvidia
I1029 04:39:17.495950 7568 nvc.c:203] loading kernel module nvidia_uvm
I1029 04:39:17.496145 7568 nvc.c:211] loading kernel module nvidia_modeset
I1029 04:39:17.496705 7569 driver.c:133] starting driver service
I1029 04:39:17.499820 7562 driver.c:233] driver service terminated with signal 15

Docker command, image and tag used

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

The text was updated successfully, but these errors were encountered:

3XX0 · 2018-10-31T18:36:30Z

[Mon Oct 29 12:07:39 2018] NVRM: Xid (PCI:0000:8a:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 4, subpartition 1.
[Mon Oct 29 12:07:40 2018] NVRM: Xid (PCI:0000:8a:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce8ac).

One of your GPU hit a DBE, you should reboot to retire the faulting page.
Also check that your driver is installed properly and libcuda.so.1 matches your driver version

RenaudWasTaken · 2018-11-23T20:44:42Z

As mentioned above, check that your driver is installed properly. You can try to launch a CUDA application outside of containers, like the CUDA samples, and it will fail if that's the case.

Closing as there wasn't any follow-up. Feel free to re-open if you have any more information.

RenaudWasTaken closed this as completed Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

goodluckbot commented Oct 29, 2018 •

edited

Loading

3XX0 commented Oct 31, 2018

RenaudWasTaken commented Nov 23, 2018

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

Comments

goodluckbot commented Oct 29, 2018 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

3XX0 commented Oct 31, 2018

RenaudWasTaken commented Nov 23, 2018

goodluckbot commented Oct 29, 2018 •

edited

Loading