Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

Closed
8 tasks done
goodluckbot opened this issue Oct 29, 2018 · 2 comments
Closed
8 tasks done

Unable to use nvidia-docker 2.0 with driver version 390.30 #849

goodluckbot opened this issue Oct 29, 2018 · 2 comments

Comments

@goodluckbot
Copy link

goodluckbot commented Oct 29, 2018

1. Issue or feature description

Not able to use nvidia-docker2.0 with driver version 390.30
And get error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: 
starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, 
stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure 
--ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=10.0 
brand=tesla,driver>=384,driver<385 --pid=29004 
/data/docker_rt/overlay2/65412291f76b94894ec8cdd271e3a7534a60a3e38a28e79fbc25d4276d70a215/merged]
\\\\nnvidia-container-cli: initialization error: cuda error: initialization error\\\\n\\\"\"": unknown.

2. Steps to reproduce the issue

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

3. Information to attach (optional if deemed irrelevant)

  • Kernel version from uname -a
Linux hostname 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
[Mon Oct 29 12:07:38 2018] device vethcd22dec entered promiscuous mode
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered forwarding state
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered forwarding state
[Mon Oct 29 12:07:39 2018] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 236
[Mon Oct 29 12:07:39 2018] NVRM: GPU at PCI:0000:8a:00: GPU-aa1a2c8f-eacf-29be-47b0-fbf8c6491ed9
[Mon Oct 29 12:07:39 2018] NVRM: GPU Board Serial Number: 0321116070750
[Mon Oct 29 12:07:39 2018] NVRM: Xid (PCI:0000:8a:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 4, subpartition 1.
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:39 2018] device vethcd22dec left promiscuous mode
[Mon Oct 29 12:07:39 2018] docker0: port 1(vethcd22dec) entered disabled state
[Mon Oct 29 12:07:40 2018] NVRM: Xid (PCI:0000:8a:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce8ac).
[Mon Oct 29 12:39:50 2018] device vetha1169e8 entered promiscuous mode
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered forwarding state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered forwarding state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state
[Mon Oct 29 12:39:50 2018] device vetha1169e8 left promiscuous mode
[Mon Oct 29 12:39:50 2018] docker0: port 1(vetha1169e8) entered disabled state
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                           : Mon Oct 29 12:28:27 2018
Driver Version                      : 390.30

Attached GPUs                       : 8
GPU 00000000:05:00.0
    Product Name                    : Tesla K80
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
......
  • Docker version from docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:23:03 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:25:29 2018
  OS/Arch:          linux/amd64
  Experimental:     false
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    output of rpm -qa '*nvidia*'
libnvidia-container1-1.0.0-1.x86_64
nvidia-container-runtime-2.0.0-1.docker18.06.1.x86_64
nvidia-container-runtime-hook-1.4.0-2.x86_64
nvidia-docker2-2.0.3-1.docker18.06.1.ce.noarch
pcp-pmda-nvidia-gpu-3.10.6-2.el7.x86_64
libnvidia-container-tools-1.0.0-1.x86_64
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.0.0
build date: 2018-09-20T20:25+0000
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • NVIDIA container library logs (see troubleshooting)
    cat /var/log/nvidia-container-runtime-hook.log
-- WARNING, the following logs are for debugging purposes only --

I1029 04:39:17.489945 7562 nvc.c:281] initializing library context (version=1.0.0, build=881c88e2e5bb682c9bb14e68bd165cfb64563bb1)
I1029 04:39:17.490054 7562 nvc.c:255] using root /
I1029 04:39:17.490067 7562 nvc.c:256] using ldcache /etc/ld.so.cache
I1029 04:39:17.490077 7562 nvc.c:257] using unprivileged user 65534:65534
I1029 04:39:17.495224 7568 nvc.c:191] loading kernel module nvidia
I1029 04:39:17.495950 7568 nvc.c:203] loading kernel module nvidia_uvm
I1029 04:39:17.496145 7568 nvc.c:211] loading kernel module nvidia_modeset
I1029 04:39:17.496705 7569 driver.c:133] starting driver service
I1029 04:39:17.499820 7562 driver.c:233] driver service terminated with signal 15
  • Docker command, image and tag used
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
@3XX0
Copy link
Member

3XX0 commented Oct 31, 2018

[Mon Oct 29 12:07:39 2018] NVRM: Xid (PCI:0000:8a:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 4, subpartition 1.
[Mon Oct 29 12:07:40 2018] NVRM: Xid (PCI:0000:8a:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000002ce8ac).

One of your GPU hit a DBE, you should reboot to retire the faulting page.
Also check that your driver is installed properly and libcuda.so.1 matches your driver version

@RenaudWasTaken
Copy link
Contributor

As mentioned above, check that your driver is installed properly. You can try to launch a CUDA application outside of containers, like the CUDA samples, and it will fail if that's the case.

Closing as there wasn't any follow-up. Feel free to re-open if you have any more information.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants