Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

silencekev · 2019-11-22T07:11:37Z

Hi,
I'm trying to get nvidia-docker work in fullstack simulator flow. I hit driver timeout issue while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi".

1. Issue or feature description

Here are the messages I see:

root@fsf-linux-x64:~/host-shared# docker run -it --rm --gpus 1 ubuntu /bin/bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.
ERRO[0098] error waiting for container: context canceled

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| No running processes found |
+-----------------------------------------------------------------------------+

I also checked dmsg of driver. I don't see obvious error.

2. Steps to reproduce the issue

This is running on simulator other than silicon. It's a long story to setup the simulator env.

3. Information

docker_issue_report_info.txt

driver_info.log
dmesg.log

since this is running on simulator(fmodel), it's expected to be slow. can Persistence mode disable the time out check? is there any other way to disable time out check?

RenaudWasTaken · 2019-11-24T21:42:50Z

You didn't attach the most important command: nvidia-container-cli -k -d /dev/tty info

My guess is that your driver installation isn't correctly setup. You should try to install and run a normal cuda sample (e.g: deviceQuery).

Please re-install your driver: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

silencekev · 2019-11-25T02:26:45Z

It's in docker_issue_report_info.txt attached. Paste it again here.

Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
NVRM version: 445.00
CUDA version: 11.0

Device Index: 0
Device Minor: 0
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:01.0
Architecture: 8.0

Device Index: 1
Device Minor: 1
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:02.0
Architecture: 8.0

In host, I can run pass a cuda application named "acos" successfully. Again, I'm running the container in a visualized QEMU OS of FSF flow.

RenaudWasTaken · 2019-11-25T03:16:10Z

On my machine the command nvidia-container-cli -k -d /dev/tty info attaches a lot of log information that are omitted here.
e.g:

I1125 03:14:01.688101 16493 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 03:14:01.688303 16493 nvc.c:255] using root /
I1125 03:14:01.688372 16493 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 03:14:01.688400 16493 nvc.c:257] using unprivileged user 1000:1000
W1125 03:14:01.690219 16494 nvc.c:186] failed to set inheritable capabilities
W1125 03:14:01.690339 16494 nvc.c:187] skipping kernel modules load due to failure
I1125 03:14:01.690843 16495 driver.c:133] starting driver service
I1125 03:14:01.751051 16493 nvc_info.c:437] requesting driver information with ''
I1125 03:14:01.751314 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.410.48
I1125 03:14:01.751473 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.410.48
I1125 03:14:01.751520 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.410.48
I1125 03:14:01.751559 16493 nvc_info.c:153] skipping /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.410.48
I1125 03:14:01.751605 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.410.48
I1125 03:14:01.751649 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.48
I1125 03:14:01.751706 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.410.48
I1125 03:14:01.751774 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.410.48
I1125 03:14:01.751814 16493 nvc_info.c:151] selecting /usr/local/cuda-10.0/lib64/libnvidia-ml.so.410.48
I1125 03:14:01.751860 16493 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.410.48 over /usr/local/cuda-10.0/lib64/libnv

RenaudWasTaken · 2019-11-25T03:20:37Z

Typically the timeout errors are seen because of a bad driver install.
Note that the timeout is of 10 seconds and not configurable.

silencekev · 2019-11-25T10:05:02Z

Oh, I didn't realize those messages are important. I tried the command again, but it complains driver error this time:
root@fsf-linux-x64:/mnt/tmp/input# nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I1125 09:57:37.507340 1396 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 09:57:37.507420 1396 nvc.c:255] using root /
I1125 09:57:37.507430 1396 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 09:57:37.507438 1396 nvc.c:257] using unprivileged user 65534:65534
I1125 09:57:37.528841 1397 nvc.c:191] loading kernel module nvidia
I1125 09:57:37.529745 1397 nvc.c:203] loading kernel module nvidia_uvm
I1125 09:57:37.564577 1397 nvc.c:211] loading kernel module nvidia_modeset
I1125 09:57:37.610527 1416 driver.c:133] starting driver service
W1125 09:58:03.487165 1396 driver.c:220] terminating driver service (forced)
I1125 10:02:20.196894 1396 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out

Let me try to re-install the driver.

Note that the timeout is of 10 seconds and not configurable.
If I want to change the value, I have to compile the source code by myself, right?

silencekev · 2019-11-25T10:36:14Z

It looks "nvidia-container-cli -k -d /dev/tty info" failed just because I forgot to enable persistence mode.
After enabling it, I got the information:
root@fsf-linux-x64:/mnt/tmp/input# nvidia-container-cli -k -d /dev/tty info |& tee info.txt

-- WARNING, the following logs are for debugging purposes only --

I1125 10:30:07.223710 1612 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1125 10:30:07.223804 1612 nvc.c:255] using root /
I1125 10:30:07.223823 1612 nvc.c:256] using ldcache /etc/ld.so.cache
I1125 10:30:07.223838 1612 nvc.c:257] using unprivileged user 65534:65534
I1125 10:30:07.252028 1614 nvc.c:191] loading kernel module nvidia
I1125 10:30:07.252466 1614 nvc.c:203] loading kernel module nvidia_uvm
I1125 10:30:07.252598 1614 nvc.c:211] loading kernel module nvidia_modeset
I1125 10:30:07.293319 1615 driver.c:133] starting driver service
I1125 10:30:12.479316 1612 nvc_info.c:437] requesting driver information with ''
I1125 10:30:12.515486 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.445.00
I1125 10:30:12.515642 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.445.00
I1125 10:30:12.515728 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.445.00
I1125 10:30:12.515811 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.445.00
I1125 10:30:12.515887 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.445.00
I1125 10:30:12.515986 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.445.00
I1125 10:30:12.516169 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.445.00
I1125 10:30:12.516252 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.445.00
I1125 10:30:12.516351 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.445.00
I1125 10:30:12.516454 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.445.00
I1125 10:30:12.516523 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.445.00
I1125 10:30:12.516594 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.445.00
I1125 10:30:12.516665 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.445.00
I1125 10:30:12.516779 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.445.00
I1125 10:30:12.516856 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.445.00
I1125 10:30:12.516955 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.445.00
I1125 10:30:12.517032 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.445.00
I1125 10:30:12.517107 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.445.00
I1125 10:30:12.517210 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.445.00
I1125 10:30:12.517405 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.445.00
I1125 10:30:12.517554 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.445.00
I1125 10:30:12.517632 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.445.00
I1125 10:30:12.517705 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.445.00
I1125 10:30:12.517792 1612 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.445.00
W1125 10:30:12.517840 1612 nvc_info.c:306] missing compat32 library libnvidia-ml.so
W1125 10:30:12.517860 1612 nvc_info.c:306] missing compat32 library libnvidia-cfg.so
W1125 10:30:12.517876 1612 nvc_info.c:306] missing compat32 library libcuda.so
W1125 10:30:12.517891 1612 nvc_info.c:306] missing compat32 library libnvidia-opencl.so
W1125 10:30:12.517906 1612 nvc_info.c:306] missing compat32 library libnvidia-ptxjitcompiler.so
W1125 10:30:12.517922 1612 nvc_info.c:306] missing compat32 library libnvidia-fatbinaryloader.so
W1125 10:30:12.517937 1612 nvc_info.c:306] missing compat32 library libnvidia-compiler.so
W1125 10:30:12.517952 1612 nvc_info.c:306] missing compat32 library libvdpau_nvidia.so
W1125 10:30:12.517968 1612 nvc_info.c:306] missing compat32 library libnvidia-encode.so
W1125 10:30:12.517983 1612 nvc_info.c:306] missing compat32 library libnvidia-opticalflow.so
W1125 10:30:12.517998 1612 nvc_info.c:306] missing compat32 library libnvcuvid.so
W1125 10:30:12.518014 1612 nvc_info.c:306] missing compat32 library libnvidia-eglcore.so
W1125 10:30:12.518029 1612 nvc_info.c:306] missing compat32 library libnvidia-glcore.so
W1125 10:30:12.518044 1612 nvc_info.c:306] missing compat32 library libnvidia-tls.so
W1125 10:30:12.518059 1612 nvc_info.c:306] missing compat32 library libnvidia-glsi.so
W1125 10:30:12.518074 1612 nvc_info.c:306] missing compat32 library libnvidia-fbc.so
W1125 10:30:12.518089 1612 nvc_info.c:306] missing compat32 library libnvidia-ifr.so
W1125 10:30:12.518105 1612 nvc_info.c:306] missing compat32 library libnvidia-rtcore.so
W1125 10:30:12.518120 1612 nvc_info.c:306] missing compat32 library libnvoptix.so
W1125 10:30:12.518135 1612 nvc_info.c:306] missing compat32 library libGLX_nvidia.so
W1125 10:30:12.518149 1612 nvc_info.c:306] missing compat32 library libEGL_nvidia.so
W1125 10:30:12.518164 1612 nvc_info.c:306] missing compat32 library libGLESv2_nvidia.so
W1125 10:30:12.518179 1612 nvc_info.c:306] missing compat32 library libGLESv1_CM_nvidia.so
W1125 10:30:12.518194 1612 nvc_info.c:306] missing compat32 library libnvidia-glvkspirv.so
I1125 10:30:12.518617 1612 nvc_info.c:232] selecting /usr/bin/nvidia-smi
I1125 10:30:12.518682 1612 nvc_info.c:232] selecting /usr/bin/nvidia-debugdump
I1125 10:30:12.518724 1612 nvc_info.c:232] selecting /usr/bin/nvidia-persistenced
I1125 10:30:12.518774 1612 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-control
I1125 10:30:12.518821 1612 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-server
I1125 10:30:12.518875 1612 nvc_info.c:369] listing device /dev/nvidiactl
I1125 10:30:12.518891 1612 nvc_info.c:369] listing device /dev/nvidia-uvm
I1125 10:30:12.518906 1612 nvc_info.c:369] listing device /dev/nvidia-uvm-tools
I1125 10:30:12.518921 1612 nvc_info.c:369] listing device /dev/nvidia-modeset
I1125 10:30:12.518980 1612 nvc_info.c:273] listing ipc /run/nvidia-persistenced/socket
W1125 10:30:12.519016 1612 nvc_info.c:277] missing ipc /tmp/nvidia-mps
I1125 10:30:12.519034 1612 nvc_info.c:493] requesting device information with ''
I1125 10:30:13.037711 1612 nvc_info.c:523] listing device /dev/nvidia0 (GPU-f6725d35-79da-1beb-34cb-e45266743217 at 00000000:00:01.0)
I1125 10:30:13.525357 1612 nvc_info.c:523] listing device /dev/nvidia1 (GPU-f6725d35-79da-1beb-34cb-e45266743217 at 00000000:00:02.0)
I1125 10:30:13.558389 1612 nvc.c:318] shutting down library context
I1125 10:30:13.676495 1615 driver.c:192] W1125 10:30:13.713572 1612 driver.c:220] terminating driver service (forced)
I1125 10:30:14.302469 1612 driver.c:233] driver service terminated with signal 9
NVRM version: 445.00
CUDA version: 11.0

Device Index: 0
Device Minor: 0
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:01.0
Architecture: 8.0

Device Index: 1
Device Minor: 1
Model: Graphics Device
Brand: Tesla
GPU UUID: GPU-f6725d35-79da-1beb-34cb-e45266743217
Bus Location: 00000000:00:02.0
Architecture: 8.0

By the way, after this, my container run still failed due to driver time out.

root@fsf-linux-x64:/mnt/tmp/input# docker run --env CUDA_c702f783=0xa15c1ed9 --gpus 2 nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.

since the simulator below is fmodel other than silicon, I expect slow gpu response. does this sounds make sense?

silencekev · 2019-11-28T02:50:15Z

This is the callstack I see.

#0 0x00007f8c575a2bc4 in __GI___poll (fds=fds@entry=0x7fff140b9098, nfds=nfds@entry=1, timeout=timeout@entry=25000)
at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f8c575e166f in readunix (ctptr=0x55fc5a6c71e0 "\004", buf=0x55fc5a6c8330 "", len=4000) at clnt_unix.c:549
#2 0x00007f8c575dd460 in fill_input_buf (rstrm=0x55fc5a6c7300) at xdr_rec.c:567
#3 get_input_bytes (len=4, addr=0x7fff140b9154 "", rstrm=) at xdr_rec.c:585
#4 set_input_fragment (rstrm=) at xdr_rec.c:603
#5 xdrrec_getbytes (len=4, addr=0x7fff140b9150 "}", xdrs=) at xdr_rec.c:263
#6 xdrrec_getlong (xdrs=, lp=0x7fff140b91a0) at xdr_rec.c:219
#7 0x00007f8c575ea689 in __GI_xdr_u_long (xdrs=xdrs@entry=0x55fc5a6c72a8, ulp=ulp@entry=0x7fff140b9230) at xdr.c:215
#8 0x00007f8c575dbac1 in __GI_xdr_replymsg (xdrs=xdrs@entry=0x55fc5a6c72a8, rmsg=rmsg@entry=0x7fff140b9230) at rpc_prot.c:135
#9 0x00007f8c575e128a in clntunix_call (h=0x55fc5a6c72e0, proc=, xdr_args=0x7f8c56e78890,
args_ptr=0x7fff140b92f8 "\370plZ\374U", xdr_results=0x7f8c56e788b0, results_ptr=0x7fff140b9320 "", timeout=...) at clnt_unix.c:265
#10 0x00007f8c56e79279 in ?? () from /usr/local/lib/libnvidia-container.so.1
#11 0x00007f8c56e6dabb in ?? () from /usr/local/lib/libnvidia-container.so.1
#12 0x00007f8c56e71672 in nvc_init () from /usr/local/lib/libnvidia-container.so.1
#13 0x000055fc5a0a2c51 in ?? ()
#14 0x000055fc5a0a063b in ?? ()
#15 0x00007f8c574afb97 in __libc_start_main (main=0x55fc5a0a05f0, argc=10, argv=0x7fff140ba928, init=, fini=,
rtld_fini=, stack_end=0x7fff140ba918) at ../csu/libc-start.c:310
#16 0x000055fc5a0a069a in ?? ()

I think the Timeout happens on
driver_init_1_svc()
{
if (call_cuda(ctx, cuInit, 0) < 0)
goto fail;
}

Is this cuInit called through RPC? if yes, how to increase the timeout limit in the poll?
#0 0x00007f8c575a2bc4 in __GI___poll (fds=fds@entry=0x7fff140b9098, nfds=nfds@entry=1, timeout=timeout@entry=25000)

It looks to be the code in setup_rpc_client/CLSET_TIMEOUT. but the timeout value looks to be 10 in code, why is 25000 in the callstack?

silencekev · 2019-11-28T03:26:54Z

It looks 25000 is passed in while calling clntunix_call(). clntunix_call() is hardcoded while linking libnvidia-container.so.1 to some dependency library?

RenaudWasTaken · 2019-12-07T12:32:46Z

I believe we resolved this, through email.

silencekev · 2019-12-12T05:42:22Z

yes, it's resolved. thanks for your help.

thuzhf · 2021-12-28T08:34:37Z

@silencekev @RenaudWasTaken Could you please share with me how you resolved this problem. I encountered the very same problem. And don't know how to resolve it. My driver version is 470.57.02. And GPU type is: NVIDIA A100 80G PCIE.

willn52 · 2022-02-15T23:57:48Z

@thuzhf I recently had this same issue.
In my case the solution involved testing persistence mode.
nvidia-persistenced
If this resolves your issue add it to systemd and you should be good to go.

RenaudWasTaken closed this as completed Nov 24, 2019

RenaudWasTaken reopened this Nov 25, 2019

RenaudWasTaken closed this as completed Dec 7, 2019

anirudhb11 mentioned this issue Apr 9, 2021

Nvidia-container-cli: initialization error: driver error: timed out while executing "nvidia-docker run -it nvcr.io/nvidia/pytorch:20.03-py3" #1484

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

silencekev commented Nov 22, 2019

RenaudWasTaken commented Nov 24, 2019

silencekev commented Nov 25, 2019 •

edited by nvjmayo

Loading

RenaudWasTaken commented Nov 25, 2019

RenaudWasTaken commented Nov 25, 2019

silencekev commented Nov 25, 2019

silencekev commented Nov 25, 2019

silencekev commented Nov 28, 2019

silencekev commented Nov 28, 2019

RenaudWasTaken commented Dec 7, 2019

silencekev commented Dec 12, 2019

thuzhf commented Dec 28, 2021

willn52 commented Feb 15, 2022

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

Comments

silencekev commented Nov 22, 2019

1. Issue or feature description

2. Steps to reproduce the issue

3. Information

RenaudWasTaken commented Nov 24, 2019

silencekev commented Nov 25, 2019 • edited by nvjmayo Loading

RenaudWasTaken commented Nov 25, 2019

RenaudWasTaken commented Nov 25, 2019

silencekev commented Nov 25, 2019

silencekev commented Nov 25, 2019

silencekev commented Nov 28, 2019

silencekev commented Nov 28, 2019

RenaudWasTaken commented Dec 7, 2019

silencekev commented Dec 12, 2019

thuzhf commented Dec 28, 2021

willn52 commented Feb 15, 2022

silencekev commented Nov 25, 2019 •

edited by nvjmayo

Loading