Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

nvidia-docker in centos #667

Closed
olivier-dj opened this issue Mar 14, 2018 · 5 comments
Closed

nvidia-docker in centos #667

olivier-dj opened this issue Mar 14, 2018 · 5 comments

Comments

@olivier-dj
Copy link

hello, i'm using a fedora atomic host with a system container running docker-ce-17.03.02. I installed as well cuda and nvidia-docker2 in the container, and provided the system container via atomic install --system --system-package no --storage ostree --name docker docker.io/olivenwk/centos-docker:17.03.2-cuda (https://github.com/olivier-dj/atomic-system-containers/tree/update-docker-ce/docker-centos) the exports for cuda are supposed to be functionnal inside the system container. docker info says :
Server Version: 17.03.2-ce
Runtimes: nvidia runc
Default Runtime: nvidia
Doing docker run -it nginx bash "works", it doesn't crash but if I do
`docker run nvidia/cuda nvidia-smi'
I get :

/usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:337: running prestart hook 1 caused "error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --device=all --compute --utility --require=cuda>=9.0 --pid=2377 /var/lib/docker/overlay2/cad8fca7bc7e5e606943ac111fea0ce4393e60ec7c01d8301cc1df737fd5a98c/merged]\nnvidia-container-cli: initialization error: driver error: failed to process request\n""

I'm quite sure /usr/bin/nvidia-container-cli is accessible inside the system container otherwise I would have "nvidia-container-cli\\\": executable file not found". The error message is not really clear :/

@flx42
Copy link
Member

flx42 commented Mar 14, 2018

Can you provide the information in the issue template? https://github.com/NVIDIA/nvidia-docker/blob/master/.github/ISSUE_TEMPLATE.md#3-information-to-attach-optional-if-deemed-irrelevant

At least the log and the output of nvidia-smi.

@olivier-dj
Copy link
Author

==============NVSMI LOG==============

Timestamp : Wed Mar 14 17:15:52 2018
Driver Version : 384.111

Attached GPUs : 1
GPU 00000000:00:05.0
Product Name : GeForce GTX 1080 Ti
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-50c71b0c-e681-fd3e-2c1e-aceec3447d68
Minor Number : 0
VBIOS Version : 86.02.39.00.22
MultiGPU Board : No
Board ID : 0x5
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x00
Device : 0x05
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 00000000:00:05.0
Sub System Id : 0x85E51043
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 11172 MiB
Used : 0 MiB
Free : 11172 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 54.73 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 5508 MHz
Video : 734 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5505 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

======= uname -a
Linux test.novalocal 4.15.7-200.fc26.x86_64 #1 SMP Wed Feb 28 18:01:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
but dockerd running in a centos container

======= docker version
Client:
Version: 1.13.1
API version: 1.26
Package version:
Go version: go1.8.4
Git commit: 584d391/1.13.1
Built: Thu Nov 23 21:40:58 2017
OS/Arch: linux/amd64

Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Package version:
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:21:36 2017
OS/Arch: linux/amd64
Experimental: false
======= nvidia-container-cli -V
version: 1.0.0
build date: 2018-03-06T02:05+0000
build revision: be797da00b156493e80f1ae6f38d69f23c932554
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

my system container have /etc/nvidia-container-runtime/config.toml debug line uncomment, no logs seems to be present in both container roofs and host..

@olivier-dj
Copy link
Author

olivier-dj commented Mar 14, 2018

for nvidia-container-cli -V I processed on a tmp test container, The system container has centos, Nvidia-driver is on the host, but should be accessible in the system container

@flx42
Copy link
Member

flx42 commented May 9, 2018

@olivier-dj did you solve the problem? Sorry for not answering earlier.

@olivier-dj
Copy link
Author

Yes right excuse me, so to find a work-around I did some work on the Dockerfile of the system container, I did first mount / to /host in the container. Well it may seem strange in the sense that when we want to containerize an application, it can be to avoid side effects on the system. According to the member of the Atomic team, it's not that ugly because the system containers intend to provide sometimes critical services such as kernel modules, and in some case we would like to have access to the host system as if the module we provide was installed on the host system.
Then I did a link for each library (32 and 64bit) in the docker lib files using the nvidia libs of the systems via /host. Example: ln -s /host/usr/lib/libcuda.so.${NVIDIA_VERSION} /usr/lib/ i also had to link the bins

RUN ln -s /host/usr/bin/nvidia-smi /usr/bin/ && 
        ln -s /host/usr/bin/nvidia-debugdump /usr/bin/ && 
        ln -s /host/usr/bin/nvidia-persistenced /usr/bin/ && 
        ln -s /host/usr/bin/nvidia-cuda-mps-control /usr/bin/ && 
        ln -s /host/usr/bin/nvidia-cuda-mps-server /usr/bin/

It was 2 months ago I don't remember exactly what was the bottleneck, but if I remember well if one of the links was missing it wouldn't work, (quite unexpected for the 32bit libraries). After that I probably had a problem with selinux and then after solving that It worked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants