-
Notifications
You must be signed in to change notification settings - Fork 2k
Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133
Comments
You didn't attach the most important command: My guess is that your driver installation isn't correctly setup. You should try to install and run a normal cuda sample (e.g: deviceQuery). Please re-install your driver: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html |
It's in docker_issue_report_info.txt attached. Paste it again here.
Device Index: 0 Device Index: 1 In host, I can run pass a cuda application named "acos" successfully. Again, I'm running the container in a visualized QEMU OS of FSF flow. |
On my machine the command
|
Typically the timeout errors are seen because of a bad driver install. |
Oh, I didn't realize those messages are important. I tried the command again, but it complains driver error this time: -- WARNING, the following logs are for debugging purposes only -- I1125 09:57:37.507340 1396 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836) Let me try to re-install the driver.
|
It looks "nvidia-container-cli -k -d /dev/tty info" failed just because I forgot to enable persistence mode. -- WARNING, the following logs are for debugging purposes only -- I1125 10:30:07.223710 1612 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836) Device Index: 0 Device Index: 1 By the way, after this, my container run still failed due to driver time out. root@fsf-linux-x64:/mnt/tmp/input# docker run --env CUDA_c702f783=0xa15c1ed9 --gpus 2 nvidia/cuda:9.0-base nvidia-smi since the simulator below is fmodel other than silicon, I expect slow gpu response. does this sounds make sense? |
This is the callstack I see. #0 0x00007f8c575a2bc4 in __GI___poll (fds=fds@entry=0x7fff140b9098, nfds=nfds@entry=1, timeout=timeout@entry=25000) I think the Timeout happens on Is this cuInit called through RPC? if yes, how to increase the timeout limit in the poll? It looks to be the code in setup_rpc_client/CLSET_TIMEOUT. but the timeout value looks to be 10 in code, why is 25000 in the callstack? |
It looks 25000 is passed in while calling clntunix_call(). clntunix_call() is hardcoded while linking libnvidia-container.so.1 to some dependency library? |
I believe we resolved this, through email. |
yes, it's resolved. thanks for your help. |
@silencekev @RenaudWasTaken Could you please share with me how you resolved this problem. I encountered the very same problem. And don't know how to resolve it. My driver version is 470.57.02. And GPU type is: NVIDIA A100 80G PCIE. |
@thuzhf I recently had this same issue. |
Hi,
I'm trying to get nvidia-docker work in fullstack simulator flow. I hit driver timeout issue while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi".
1. Issue or feature description
Here are the messages I see:
root@fsf-linux-x64:~/host-shared# docker run -it --rm --gpus 1 ubuntu /bin/bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\n\""": unknown.
ERRO[0098] error waiting for container: context canceled
I found a similar issue #628, which is fixed by running nvidia-persistenced. I did same thing and confirmed persistence mode is enabled by running nvidia-smi in host.
root@fsf-linux-x64:~/host-shared# nvidia-smi
Fri Nov 22 06:34:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.00 Driver Version: 445.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=======+====================+======================|
| 0 Graphics Device On | 00000000:00:01.0 Off | N/A |
| N/A ERR! N/A N/A / N/A | 0MiB / 860MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device On | 00000000:00:02.0 Off | N/A |
| N/A ERR! N/A N/A / N/A | 0MiB / 860MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==============================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I also checked dmsg of driver. I don't see obvious error.
2. Steps to reproduce the issue
This is running on simulator other than silicon. It's a long story to setup the simulator env.
3. Information
docker_issue_report_info.txt
driver_info.log
dmesg.log
since this is running on simulator(fmodel), it's expected to be slow. can Persistence mode disable the time out check? is there any other way to disable time out check?
The text was updated successfully, but these errors were encountered: