Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Nvidia-container-cli: initialization error: driver error: timed out while executing "nvidia-docker run -it nvcr.io/nvidia/pytorch:20.03-py3" #1484

Closed
8 tasks done
anirudhb11 opened this issue Apr 8, 2021 · 1 comment

Comments

@anirudhb11
Copy link

anirudhb11 commented Apr 8, 2021

Hi everyone, the issue I am facing is similar to the one raise in #1133 and #628, I have persistence enabled and I think this is not a driver issue as I am able to launch the container but very rarely.

1. Issue or feature description

On running the command:

nvidia-docker run -it nvcr.io/nvidia/ytorch:20.03-py3

I get an output as:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused "process_linux.go:407: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --video --require=cuda>=9.0 --pid=65049 /var/lib/docker/overlay2/0ef478925f89159ffbd6637a0b7afbdbdd69c78e31b6570a088529b5744632af/merged]\\nnvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.

2. Steps to reproduce the issue

  • docker pull nvcr.io/nvidia/pytorch:21.03-py3
  • nvidia-docker run -it nvcr.io/nvidia/ytorch:20.03-py3

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info

I0408 11:01:43.762921 7345 nvc.c:281] initializing library context (version=1.0.1, build=038fb92d00c94f97d61492d4ed1f82e981129b74)
I0408 11:01:43.763137 7345 nvc.c:255] using root /
I0408 11:01:43.763179 7345 nvc.c:256] using ldcache /etc/ld.so.cache
I0408 11:01:43.763218 7345 nvc.c:257] using unprivileged user 1011:1011
W0408 11:01:43.791108 7346 nvc.c:186] failed to set inheritable capabilities
W0408 11:01:43.791465 7346 nvc.c:187] skipping kernel modules load due to failure
I0408 11:01:43.793110 7347 driver.c:133] starting driver service
W0408 11:02:08.828662 7345 driver.c:220] terminating driver service (forced)
I0408 11:02:23.293544 7345 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out

  • Kernel version from uname -a

Linux 4.15.0-45-generic 48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Driver information from nvidia-smi

Thu Apr 8 16:38:13 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 30C P0 43W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

  • Docker version from docker version

Client:
Version: 18.09.2
API version: 1.39
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 04:13:47 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:42:13 2019
OS/Arch: linux/amd64
Experimental: false

  • NVIDIA packages version from dpkg -l '*nvidia*'

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================-===============-===============-================================================
un libgldispatch0-nvidia (no description available)
ii libnvidia-cfg1-410:am 410.104-0ubuntu amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any (no description available)
un libnvidia-common (no description available)
ii libnvidia-common-410 410.104-0ubuntu all Shared files used by the NVIDIA libraries
ii libnvidia-compute-410 410.104-0ubuntu amd64 NVIDIA libcompute package
ii libnvidia-container-t 1.0.1-1 amd64 NVIDIA container runtime library (command-line t
ii libnvidia-container1: 1.0.1-1 amd64 NVIDIA container runtime library
un libnvidia-decode (no description available)
ii libnvidia-decode-410: 410.104-0ubuntu amd64 NVIDIA Video Decoding runtime libraries
un libnvidia-diagnostic (no description available)
ii libnvidia-diagnostic- 410.104-0ubuntu amd64 NVIDIA driver diagnostics utilities
un libnvidia-encode (no description available)
ii libnvidia-encode-410: 410.104-0ubuntu amd64 NVENC Video Encoding runtime library
un libnvidia-fbc1 (no description available)
ii libnvidia-fbc1-410:am 410.104-0ubuntu amd64 NVIDIA OpenGL-based Framebuffer Capture runtime
un libnvidia-gl (no description available)
ii libnvidia-gl-410:amd6 410.104-0ubuntu amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and V
un libnvidia-ifr1 (no description available)
ii libnvidia-ifr1-410:am 410.104-0ubuntu amd64 NVIDIA OpenGL-based Inband Frame Readback runtim
un nvhealth-module-nvidi (no description available)
un nvidia-304 (no description available)
un nvidia-340 (no description available)
un nvidia-384 (no description available)
un nvidia-390 (no description available)
ii nvidia-compute-utils- 410.104-0ubuntu amd64 NVIDIA compute utilities
ii nvidia-container-runt 2.0.0+docker18. amd64 NVIDIA container runtime
ii nvidia-container-runt 1.4.0-1 amd64 NVIDIA container runtime hook
un nvidia-current-diagno (no description available)
ii nvidia-dkms-410 410.104-0ubuntu amd64 NVIDIA DKMS package
un nvidia-dkms-kernel (no description available)
un nvidia-docker (no description available)
ii nvidia-docker2 2.0.3+docker18. all nvidia-docker CLI wrapper
ii nvidia-driver-410 410.104-0ubuntu amd64 NVIDIA driver metapackage
un nvidia-driver-binary (no description available)
ii nvidia-headless-410 410.104-0ubuntu amd64 NVIDIA headless metapackage
ii nvidia-headless-no-dk 410.104-0ubuntu amd64 NVIDIA headless metapackage - no DKMS
un nvidia-kernel-common (no description available)
ii nvidia-kernel-common- 410.104-0ubuntu amd64 Shared files used with the kernel module
un nvidia-kernel-source (no description available)
ii nvidia-kernel-source- 410.104-0ubuntu amd64 NVIDIA kernel source package
ii nvidia-modprobe 410.104-0ubuntu amd64 Load the NVIDIA kernel driver and create device
un nvidia-opencl-icd (no description available)
ii nvidia-peer-memory 1.0-7 all nvidia peer memory kernel module.
ii nvidia-peer-memory-dk 1.0-7 all DKMS support for nvidia-peer-memory kernel modul
un nvidia-persistenced (no description available)
un nvidia-prime (no description available)
ii nvidia-settings 410.104-0ubuntu amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binar (no description available)
un nvidia-smi (no description available)
un nvidia-thea (no description available)
un nvidia-utils (no description available)
ii nvidia-utils-410 410.104-0ubuntu amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nv 410.104-0ubuntu amd64 NVIDIA binary Xorg driver

  • NVIDIA container library version from nvidia-container-cli -V

version: 1.0.1
build date: 2019-01-15T23:24+00:00
build revision: 038fb92d00c94f97d61492d4ed1f82e981129b74
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • Any relevant kernel output lines from dmesg

[1149264.422516] docker0: port 1(veth1497707) entered disabled state
[1149264.442891] device veth1497707 left promiscuous mode
[1149264.442923] docker0: port 1(veth1497707) entered disabled state

  • Docker command, image and tag used

Command: nvidia-docker run -it nvcr.io/nvidia/pytorch:20.03-py3
Image and tag: pytorch: 20.03

@elezar
Copy link
Member

elezar commented Apr 20, 2021

Hi @Anirudhb11021999, would it be possible to install a newer version of the NVIDIA container library and see if this problem persists?

@elezar elezar closed this as completed Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants