Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Cannot run docker with > 1 GPU #1145

Closed
6 tasks
hect1995 opened this issue Dec 5, 2019 · 1 comment
Closed
6 tasks

Cannot run docker with > 1 GPU #1145

hect1995 opened this issue Dec 5, 2019 · 1 comment

Comments

@hect1995
Copy link

hect1995 commented Dec 5, 2019

1. Issue or feature description

I am following the instructions to install NVIDIA Container Toolkit for my Ubuntu 18.04 system. The first command:
$ docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
returns:

Thu Dec  5 14:21:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M5000M       On   | 00000000:01:00.0  On |                  N/A |
| N/A   46C    P0    25W / 100W |    601MiB /  8123MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

But at the moment I try to work with more than one GPU it crashes.

2. Steps to reproduce the issue

developer@Ubuntu106411:~$ docker run --gpus 1,2 nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: 1\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
    -- WARNING, the following logs are for debugging purposes only --

I1205 14:23:08.263725 5448 nvc.c:281] initializing library context (version=1.0.5, build=13b836390888f7b7c7dca115d16d7e28ab15a836)
I1205 14:23:08.263813 5448 nvc.c:255] using root /
I1205 14:23:08.263826 5448 nvc.c:256] using ldcache /etc/ld.so.cache
I1205 14:23:08.263838 5448 nvc.c:257] using unprivileged user 1000:1000
W1205 14:23:08.266165 5449 nvc.c:186] failed to set inheritable capabilities
W1205 14:23:08.266242 5449 nvc.c:187] skipping kernel modules load due to failure
I1205 14:23:08.266690 5450 driver.c:133] starting driver service
I1205 14:23:08.282948 5448 nvc_info.c:437] requesting driver information with ''
I1205 14:23:08.283513 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.440.33.01
I1205 14:23:08.283572 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.440.33.01
I1205 14:23:08.283620 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.440.33.01 over /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.440.33.01
I1205 14:23:08.283662 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.440.33.01
I1205 14:23:08.283716 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.440.33.01
I1205 14:23:08.283775 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.440.33.01
I1205 14:23:08.283834 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.440.33.01
I1205 14:23:08.283872 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.33.01
I1205 14:23:08.283930 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.440.33.01
I1205 14:23:08.283986 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.440.33.01
I1205 14:23:08.284025 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.440.33.01
I1205 14:23:08.284065 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.440.33.01
I1205 14:23:08.284104 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.440.33.01
I1205 14:23:08.284161 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.33.01
I1205 14:23:08.284202 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.440.33.01
I1205 14:23:08.284260 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.440.33.01
I1205 14:23:08.284300 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.440.33.01
I1205 14:23:08.284340 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.440.33.01
I1205 14:23:08.284399 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.440.33.01
I1205 14:23:08.284640 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01
I1205 14:23:08.284800 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.440.33.01
I1205 14:23:08.284848 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.440.33.01
I1205 14:23:08.284889 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.440.33.01
I1205 14:23:08.284931 5448 nvc_info.c:151] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.440.33.01
W1205 14:23:08.284974 5448 nvc_info.c:302] missing library libvdpau_nvidia.so
W1205 14:23:08.284983 5448 nvc_info.c:306] missing compat32 library libnvidia-ml.so
W1205 14:23:08.284991 5448 nvc_info.c:306] missing compat32 library libnvidia-cfg.so
W1205 14:23:08.284999 5448 nvc_info.c:306] missing compat32 library libcuda.so
W1205 14:23:08.285007 5448 nvc_info.c:306] missing compat32 library libnvidia-opencl.so
W1205 14:23:08.285014 5448 nvc_info.c:306] missing compat32 library libnvidia-ptxjitcompiler.so
W1205 14:23:08.285021 5448 nvc_info.c:306] missing compat32 library libnvidia-fatbinaryloader.so
W1205 14:23:08.285029 5448 nvc_info.c:306] missing compat32 library libnvidia-compiler.so
W1205 14:23:08.285037 5448 nvc_info.c:306] missing compat32 library libvdpau_nvidia.so
W1205 14:23:08.285044 5448 nvc_info.c:306] missing compat32 library libnvidia-encode.so
W1205 14:23:08.285052 5448 nvc_info.c:306] missing compat32 library libnvidia-opticalflow.so
W1205 14:23:08.285060 5448 nvc_info.c:306] missing compat32 library libnvcuvid.so
W1205 14:23:08.285067 5448 nvc_info.c:306] missing compat32 library libnvidia-eglcore.so
W1205 14:23:08.285077 5448 nvc_info.c:306] missing compat32 library libnvidia-glcore.so
W1205 14:23:08.285086 5448 nvc_info.c:306] missing compat32 library libnvidia-tls.so
W1205 14:23:08.285094 5448 nvc_info.c:306] missing compat32 library libnvidia-glsi.so
W1205 14:23:08.285102 5448 nvc_info.c:306] missing compat32 library libnvidia-fbc.so
W1205 14:23:08.285110 5448 nvc_info.c:306] missing compat32 library libnvidia-ifr.so
W1205 14:23:08.285118 5448 nvc_info.c:306] missing compat32 library libnvidia-rtcore.so
W1205 14:23:08.285125 5448 nvc_info.c:306] missing compat32 library libnvoptix.so
W1205 14:23:08.285133 5448 nvc_info.c:306] missing compat32 library libGLX_nvidia.so
W1205 14:23:08.285140 5448 nvc_info.c:306] missing compat32 library libEGL_nvidia.so
W1205 14:23:08.285149 5448 nvc_info.c:306] missing compat32 library libGLESv2_nvidia.so
W1205 14:23:08.285157 5448 nvc_info.c:306] missing compat32 library libGLESv1_CM_nvidia.so
W1205 14:23:08.285164 5448 nvc_info.c:306] missing compat32 library libnvidia-glvkspirv.so
I1205 14:23:08.285505 5448 nvc_info.c:232] selecting /usr/bin/nvidia-smi
I1205 14:23:08.285528 5448 nvc_info.c:232] selecting /usr/bin/nvidia-debugdump
I1205 14:23:08.285550 5448 nvc_info.c:232] selecting /usr/bin/nvidia-persistenced
I1205 14:23:08.285570 5448 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-control
I1205 14:23:08.285590 5448 nvc_info.c:232] selecting /usr/bin/nvidia-cuda-mps-server
I1205 14:23:08.285617 5448 nvc_info.c:369] listing device /dev/nvidiactl
I1205 14:23:08.285622 5448 nvc_info.c:369] listing device /dev/nvidia-uvm
I1205 14:23:08.285627 5448 nvc_info.c:369] listing device /dev/nvidia-uvm-tools
I1205 14:23:08.285632 5448 nvc_info.c:369] listing device /dev/nvidia-modeset
I1205 14:23:08.285663 5448 nvc_info.c:273] listing ipc /run/nvidia-persistenced/socket
W1205 14:23:08.285679 5448 nvc_info.c:277] missing ipc /tmp/nvidia-mps
I1205 14:23:08.285685 5448 nvc_info.c:493] requesting device information with ''
I1205 14:23:08.291588 5448 nvc_info.c:523] listing device /dev/nvidia0 (GPU-6513d2f3-8a16-0ebe-86f2-42cb6f8f0e4a at 00000000:01:00.0)
NVRM version: 440.33.01
CUDA version: 10.2

Device Index: 0
Device Minor: 0
Model: Quadro M5000M
Brand: Quadro
GPU UUID: GPU-6513d2f3-8a16-0ebe-86f2-42cb6f8f0e4a
Bus Location: 00000000:01:00.0
Architecture: 5.2
I1205 14:23:08.291622 5448 nvc.c:318] shutting down library context
I1205 14:23:08.292054 5450 driver.c:192] terminating driver service
I1205 14:23:08.384504 5448 driver.c:233] driver service terminated successfully

Timestamp : Thu Dec 5 15:16:31 2019
Driver Version : 440.33.01
CUDA Version : 10.2

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : Quadro M5000M
Product Brand : Quadro
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-6513d2f3-8a16-0ebe-86f2-42cb6f8f0e4a
Minor Number : 0
VBIOS Version : 84.04.9B.00.0D
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x13F810DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x8109103C
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 1000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : N/A
HW Power Brake Slowdown : N/A
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 8123 MiB
Used : 601 MiB
Free : 7522 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 7 MiB
Free : 249 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 2 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 46 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 101 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 26.10 W
Power Limit : 100.00 W
Default Power Limit : 100.00 W
Enforced Power Limit : 100.00 W
Min Power Limit : 0.00 W
Max Power Limit : 100.00 W
Clocks
Graphics : 772 MHz
SM : 772 MHz
Memory : 2505 MHz
Video : 712 MHz
Applications Clocks
Graphics : 962 MHz
Memory : 2505 MHz
Default Applications Clocks
Graphics : 962 MHz
Memory : 2505 MHz
Max Clocks
Graphics : 1050 MHz
SM : 1050 MHz
Memory : 2505 MHz
Video : 966 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1192
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 96 MiB
Process ID : 1292
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 52 MiB
Process ID : 2153
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 270 MiB
Process ID : 2300
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 141 MiB
Process ID : 2663
Type : G
Name : /usr/share/skypeforlinux/skypeforlinux --type=gpu-process --disable-features=SpareRendererForSitePerProcess --gpu-preferences=KAAAAAAAAACAAAAAAQAAAAAAAAAAAGAAAAAAAAEAAAAIAAAAAAAAAAgAAAAAAAAA --service-request-channel-token=2826447589548459135
Used GPU Memory : 32 MiB

  • Docker version from docker version
    Client: Docker Engine - Community
    Version: 19.03.5
    API version: 1.40
    Go version: go1.12.12
    Git commit: 633a0ea838
    Built: Wed Nov 13 07:29:52 2019
    OS/Arch: linux/amd64
    Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:28:22 2019
OS/Arch: linux/amd64

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    ||/ Name Version Architecture Description
    +++-===============================-====================-====================-====================================================================
    un libgldispatch0-nvidia (no description available)
    ii libnvidia-cfg1-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library
    un libnvidia-cfg1-any (no description available)
    un libnvidia-common (no description available)
    ii libnvidia-common-440 440.33.01-0ubuntu1 all Shared files used by the NVIDIA libraries
    ii libnvidia-compute-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA libcompute package
    ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
    ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
    un libnvidia-decode (no description available)
    ii libnvidia-decode-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries
    un libnvidia-encode (no description available)
    ii libnvidia-encode-440:amd64 440.33.01-0ubuntu1 amd64 NVENC Video Encoding runtime library
    un libnvidia-fbc1 (no description available)
    ii libnvidia-fbc1-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
    un libnvidia-gl (no description available)
    ii libnvidia-gl-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    un libnvidia-ifr1 (no description available)
    ii libnvidia-ifr1-440:amd64 440.33.01-0ubuntu1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
    un libnvidia-ml1 (no description available)
    un nvidia-304 (no description available)
    rc nvidia-340 340.107-0ubuntu0.18. amd64 NVIDIA binary driver - version 340.107
    un nvidia-384 (no description available)
    un nvidia-390 (no description available)
    un nvidia-common (no description available)
    ii nvidia-compute-utils-440 440.33.01-0ubuntu1 amd64 NVIDIA compute utilities
    un nvidia-container-runtime (no description available)
    un nvidia-container-runtime-hook (no description available)
    ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook
    ii nvidia-dkms-440 440.33.01-0ubuntu1 amd64 NVIDIA DKMS package
    un nvidia-dkms-kernel (no description available)
    ii nvidia-driver-440 440.33.01-0ubuntu1 amd64 NVIDIA driver metapackage
    un nvidia-driver-binary (no description available)
    un nvidia-kernel-common (no description available)
    ii nvidia-kernel-common-440 440.33.01-0ubuntu1 amd64 Shared files used with the kernel module
    un nvidia-kernel-source (no description available)
    ii nvidia-kernel-source-440 440.33.01-0ubuntu1 amd64 NVIDIA kernel source package
    un nvidia-legacy-340xx-vdpau-drive (no description available)
    un nvidia-libopencl1-340 (no description available)
    un nvidia-libopencl1-dev (no description available)
    ii nvidia-modprobe 440.33.01-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
    un nvidia-opencl-icd (no description available)
    rc nvidia-opencl-icd-340 340.107-0ubuntu0.18. amd64 NVIDIA OpenCL ICD
    un nvidia-persistenced (no description available)
    ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
    ii nvidia-settings 440.33.01-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
    un nvidia-settings-binary (no description available)
    un nvidia-smi (no description available)
    un nvidia-utils (no description available)
    ii nvidia-utils-440 440.33.01-0ubuntu1 amd64 NVIDIA driver support binaries
    un nvidia-vdpau-driver (no description available)
    ii xserver-xorg-video-nvidia-440 440.33.01-0ubuntu1 amd64 NVIDIA binary Xorg driver

  • NVIDIA container library version from nvidia-container-cli -V
    version: 1.0.5
    build date: 2019-09-06T16:59+00:00
    build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
    build compiler: x86_64-linux-gnu-gcc-7 7.4.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Dec 7, 2019

I might have this wrong but you seem to only have one physical GPU attached to your machine :)
Which would explain why you have some troubles running commands with more than one GPU.

Also note that indices start at 0 so you should probably write:

developer@Ubuntu106411:~$ docker run --gpus 1 nvidia/cuda:9.0-base nvidia-smi
developer@Ubuntu106411:~$ docker run --gpus '"device=0"' nvidia/cuda:9.0-base nvidia-smi

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants