Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

[SOLVED] How to run nvidia-docker with TensorFlow GPU docker #45

Closed
bcordo opened this issue Feb 4, 2016 · 29 comments
Closed

[SOLVED] How to run nvidia-docker with TensorFlow GPU docker #45

bcordo opened this issue Feb 4, 2016 · 29 comments

Comments

@bcordo
Copy link

bcordo commented Feb 4, 2016

Thanks for releasing the nvidia-docker repo, this is a really great idea and very useful!

What I've Done

I have setup an equivalent of a Nvidia DIGITS machine (running Ubuntu 14.04 server), and am attempting to run everything in docker containers.

  1. I have docker installed, and have run nvidia-docker run nvidia/cuda nvidia-smi described here, and I see my 4 TitanX graphic cards.
  2. I have also run the nvidia-docker-plugin described here as sudo -u nvidia-docker nvidia-docker-plugin -s /var/lib/nvidia-docker and I get the output:
nvidia-docker-plugin | 2016/02/04 12:54:02 Loading NVIDIA management library
nvidia-docker-plugin | 2016/02/04 12:54:04 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/02/04 12:54:04 Discovering GPU devices
nvidia-docker-plugin | 2016/02/04 12:54:05 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2016/02/04 12:54:05 Serving plugin API at /var/lib/nvidia-docker
nvidia-docker-plugin | 2016/02/04 12:54:05 Serving remote API at localhost:3476

which signifies to me that it's working.

  1. I ran the tests here and they all passed.

My Problem

  1. When I try to run the TensorFlow GPU docker image using nvidia-docker

I first run sudo -u nvidia-docker nvidia-docker-plugin -s /var/lib/nvidia-docker in a tmux session.

Then I run nvidia-docker run -it -p 8888:8888 b.gcr.io/tensorflow/tensorflow-devel-gpu it downloads everything and runs the docker container. Next I run ipython and try to import tensorflow but I get the following errors:

In [1]: import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:92] LD_LIBRARY_PATH: /usr/local/cuda/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:121] hostname: 16b84b6e71f9
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:146] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:257] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.79  Wed Jan 13 16:17:53 PST 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel reported version is: 352.79
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1054] LD_LIBRARY_PATH: /usr/local/cuda/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1055] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so; dlerror: libcuda.so: cannot open shared object file: No such file or directory

**I think I just have a lack in understanding about how I should run the TensorFlow container, or maybe I have to build the container using nvidia-docker.

Any ideas about how to do this, or general advice about what I'm doing wrong would be amazing. **

Thanks so much.

Brad

@3XX0
Copy link
Member

3XX0 commented Feb 4, 2016

This looks somewhat related to #44
Unfortunately, as @flx42 mentioned, the Tensorflow image on the container registry is outdated.
You best bet is to rebuild the tensorflow image manually (i.e. don't use the one on b.gcr.io).

How you build it doesn't really matter, nvidia-docker or docker will do it.

@flx42
Copy link
Member

flx42 commented Feb 4, 2016

Yes, I highly recommend building the images manually. Especially since the Tensorflow code moves fast and the Docker images are now a bit old.

@bcordo
Copy link
Author

bcordo commented Feb 4, 2016

@3XX0 and @flx42 thanks so much for the really quick reply.

I went ahead with your advice and tried to build the docker images in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker. I ran the command:
docker build -t $USER/tensorflow-suffix -f Dockerfile.gpu .

To which I get the error

Step 7 : RUN pip --no-cache-dir install     https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
 ---> Running in 91e99a6e00b5
Collecting tensorflow==0.6.0 from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:315: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pip/basecommand.py", line 209, in main
    status = self.run(options, args)
  File "/usr/local/lib/python2.7/dist-packages/pip/commands/install.py", line 299, in run
    requirement_set.prepare_files(finder)
  File "/usr/local/lib/python2.7/dist-packages/pip/req/req_set.py", line 359, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/local/lib/python2.7/dist-packages/pip/req/req_set.py", line 576, in _prepare_file
    session=self.session, hashes=hashes)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 809, in unpack_url
    hashes=hashes
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 648, in unpack_http_url
    hashes)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 841, in _download_http_url
    stream=True,
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 377, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
The command '/bin/sh -c pip --no-cache-dir install     https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl' returned a non-zero code: 2

I got the same error running, build -t $USER/tensorflow-suffix -f Dockerfile.devel-gpu ., and build -t $USER/tensorflow-suffix -f Dockerfile ..

This may be a general docker error, but I've been searching and thinking about solutions and have found none.

Thanks for your time. Hopefully, this will be helpful for others as well.

@3XX0
Copy link
Member

3XX0 commented Feb 4, 2016

Weird looks like SSL CA are outdated or something.
Can you try to adding update-ca-certificates at the beginning of the RUN command in the Dockerfile:

RUN update-ca-certificates && pip --no-cache-dir install ...

@bcordo
Copy link
Author

bcordo commented Feb 4, 2016

Good idea.

Unfortunately, I still get the error:

Step 6 : ENV TENSORFLOW_VERSION 0.6.0
 ---> Using cache
 ---> 11eba5b56bca
Step 7 : RUN update-ca-certificates && pip --no-cache-dir install     https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
 ---> Running in 2344eaf2e522
Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....done.
Collecting tensorflow==0.6.0 from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f02730f6b50>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
  Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f02730f6090>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
  Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f02730f6a10>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
  Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f02730f68d0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
  Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f02730f6a90>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:315: SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
  SNIMissingWarning
/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pip/basecommand.py", line 209, in main
    status = self.run(options, args)
  File "/usr/local/lib/python2.7/dist-packages/pip/commands/install.py", line 299, in run
    requirement_set.prepare_files(finder)
  File "/usr/local/lib/python2.7/dist-packages/pip/req/req_set.py", line 359, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/local/lib/python2.7/dist-packages/pip/req/req_set.py", line 576, in _prepare_file
    session=self.session, hashes=hashes)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 809, in unpack_url
    hashes=hashes
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 648, in unpack_http_url
    hashes)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 841, in _download_http_url
    stream=True,
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/download.py", line 377, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/requests/adapters.py", line 447, in send
    raise SSLError(e, request=request)
SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
The command '/bin/sh -c update-ca-certificates && pip --no-cache-dir install     https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl' returned a non-zero code: 2

It looks like the certs didn't get updated: Updating certificates in /etc/ssl/certs... 0 added, 0 removed; done.

Thanks for the suggestions.

@bcordo
Copy link
Author

bcordo commented Feb 4, 2016

I wonder if it could be something to do with https on the host server. I booted up a droplet, tried again, and also got the same ssl error.

@3XX0
Copy link
Member

3XX0 commented Feb 5, 2016

Not sure it looks like a DNS problem now: Name or service not known
Are you behind a proxy ?

@bcordo
Copy link
Author

bcordo commented Feb 5, 2016

Interesting.

I ran env | grep -i proxy, cat /etc/environment with no outputs. So I don't think the server is behind a proxy. It's running on University wifi, so I can try running it over ethernet (will need to go across the street), and see if that helps.

@bcordo
Copy link
Author

bcordo commented Feb 5, 2016

Also if I run pdate-ca-certificates && pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl on the host machine, it downloads just fine.

@3XX0
Copy link
Member

3XX0 commented Feb 5, 2016

I just tried, same issue here.
Looks like an issue with the Tensorflow image.

@ruffsl
Copy link
Contributor

ruffsl commented Feb 5, 2016

I just inserted a RUN curl -0 https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl to download it and pointed pip at the local file. I'm not sure whats up with the ssl, but they hint at it here in a readme.

@bcordo
Copy link
Author

bcordo commented Feb 5, 2016

@ruffsl really great idea! That worked, I built the image using docker build -t $USER/tensorflow-gpu2 -f Dockerfile.gpu . I then ran a docker container using said image with ~/docker_test/docker$ nvidia-docker run -it -p 8888:8888 brad/tensorflow-gpu2 but it still for some reason doesn't find the file libcuda.so, but loading libcublas.so, libcudnn.so, libcufft.so, and libcurand.so works just fine.

brad@truegpu:~/docker_test/docker$ nvidia-docker run -it -p 8888:8888 brad/tensorflow-gpu2
root@d5b7b996d9c6:~# ipython
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
Type "copyright", "credits" or "license" for more information.

IPython 4.1.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:93] Couldn't open CUDA library libcuda.so. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:121] hostname: d5b7b996d9c6
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:146] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:257] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.79  Wed Jan 13 16:17:53 PST 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel reported version is: 352.79
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1060] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1061] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so; dlerror: libcuda.so: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally

The interesting things is that if I run

root@87d1c13a3db8:~# ls /usr/local/nvidia/lib64/ | grep libcuda
libcuda.so.1
libcuda.so.352.79

There exists these two libcuda packages, but they are not called "libcuda.so", maybe this is just a naming issue?

Hmmm...

@bcordo
Copy link
Author

bcordo commented Feb 5, 2016

Just found your comment here: tensorflow/tensorflow#808 (comment)

Which solved it! Thanks so much for the help @3XX0, @flx42, and @ruffsl. Really appreciate it.

Brad

@bcordo
Copy link
Author

bcordo commented Feb 5, 2016

For posterity here is the Dockerfile that eventually worked for me (based on the above feedback):

FROM nvidia/cuda:7.0-cudnn2-runtime

MAINTAINER Craig Citro <craigcitro@google.com>

# Pick up some TF dependencies
RUN apt-get update && apt-get install -y \
        curl \
        libfreetype6-dev \
        libpng12-dev \
        libzmq3-dev \
        pkg-config \
        python-numpy \
        python-pip \
        python-scipy \
        wget \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
    python get-pip.py && \
    rm get-pip.py

RUN pip --no-cache-dir install \
        ipykernel \
        jupyter \
        matplotlib \
        && \
    python -m ipykernel.kernelspec

# Install TensorFlow GPU version.
ENV TENSORFLOW_VERSION 0.6.0
RUN curl -O https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
RUN pip --no-cache-dir install \
    tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl

# Set up our notebook config.
COPY jupyter_notebook_config.py /root/.jupyter/

# Jupyter has issues with being run directly:
#   https://github.com/ipython/ipython/issues/7062
# We just add a little wrapper script.
COPY run_jupyter.sh /

# Create correct path for libuda so Tensorflow can open it
RUN ln -s /usr/local/nvidia/lib64/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so

# TensorBoard
EXPOSE 6006
# IPython
EXPOSE 8888

WORKDIR "/root"

CMD ["/bin/bash"]

In particular the lines

# Create correct path for libuda so Tensorflow can open it
RUN ln -s /usr/local/nvidia/lib64/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so

and

RUN curl -O https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
RUN pip --no-cache-dir install \
    tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl

were modified.

@bcordo bcordo changed the title How to run nvidia-docker with TensorFlow GPU docker [SOLVED] How to run nvidia-docker with TensorFlow GPU docker Feb 5, 2016
@3XX0 3XX0 closed this as completed Feb 5, 2016
@ruffsl
Copy link
Contributor

ruffsl commented Feb 5, 2016

No, thank you @bcordo , those are some small but working fixes. I just tested the Dockerfile below (to avoid another compile) with this example and its working fine.

FROM b.gcr.io/tensorflow/tensorflow-devel-gpu
RUN ln -s /usr/local/nvidia/lib64/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="7.0"

@3XX0
Copy link
Member

3XX0 commented Feb 5, 2016

Hopefully, Tensorflow will fix these issues.
In the meantime that's a convenient workaround.

@flx42
Copy link
Member

flx42 commented Feb 17, 2016

Since Tensorflow 0.7 was released yesterday, the new images have the proper volume tags. But the libcuda.so problem is still here apparently.
Unfortunately, they also missed our image refresh this morning after the security fix for glibc.

Their GitHub is referencing old images:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker

Those are the correct images:
https://www.tensorflow.org/versions/r0.7/get_started/os_setup.html#docker-installation

@cancan101
Copy link

Agree that issues still exists. I had ln -s /usr/local/nvidia/lib64/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so to get this to work.

@jendap
Copy link

jendap commented Feb 18, 2016

@ruffsl How does the LABEL work? Why do you have above the
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="7.0"

Is it not coming from the parent image "FROM nvidia/cuda" already?

@ruffsl
Copy link
Contributor

ruffsl commented Feb 18, 2016

@jendap , The first tensorflow images up on b.gcr.io using the nvidia/cuda images where built before nvidia made the change to add the labels to use for the nvidia docker plugin. Tensorflow has since rebuilt these images, updating the parent image, inheriting the needed labels, thus what I wrote above should no longer be necessary. Hopefully the cuda linking issue in the tensorflow gpu images will also be resolved so we can go using stock builds from google.

@jendap
Copy link

jendap commented Feb 18, 2016

Cool. thanks. Cuda linking? Do you mean the "ln -s ..."?

BTW: Are the labels supposed to tell me I have wrong cuda before it even starts the container?

@ruffsl
Copy link
Contributor

ruffsl commented Feb 18, 2016

Linking orln -s ..., yes.

I think the label is only used to make sure your host's driver is compatible with the version of cuda in the container: /tools/src/nvidia-docker/utils.go#L56
But one of the devs could correct me on that.

@jendap
Copy link

jendap commented Feb 18, 2016

That would be great if it would complain about the host driver being too old! I have to try it.

@3XX0
Copy link
Member

3XX0 commented Feb 18, 2016

Yes it helps us prevent running CUDA containers that are not supported by the driver.
It's particularly useful when deploying something remotely.

@philipz
Copy link

philipz commented Jul 21, 2016

Copy the libcudnn.so.XXXX to /var/lib/nvidia-docker/volumes/nvidia_driver/3xx.xx, and sudo ln -s libcudnn.so.5.0.5 libcudnn.so. Then cudnn will work in Tensorflow container.

@flx42
Copy link
Member

flx42 commented Jul 22, 2016

@philipz that's not a good idea to clobber the volume directory, not all containers are based on cuDNN v5. TensorFlow for instance is using cuDNN v4.

By the way, the TensorFlow images now work just fine (partly because the gpu image now depends on devel instead of runtime):

$ nvidia-docker run --rm tensorflow/tensorflow:nightly-devel-gpu python -c 'import tensorflow as tf ; print tf.__version__'
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
0.9.0
$ nvidia-docker run --rm tensorflow/tensorflow:nightly-gpu python -c 'import tensorflow as tf ; print tf.__version__'
[same]

So, no symlinks needed I believe.

@tobegit3hub
Copy link

Hi @3XX0 and @flx42 . Is it possible to run this with docker instead of nvidia-docker? We are running GPU containers with Kubernetes and have the similar problem which has been discussed above. It works for me with nvidia-docker but not with docker.

@flx42
Copy link
Member

flx42 commented Aug 15, 2016

@tobegit3hub Sure there is a way, our nvidia-docker-plugin daemon (which works as a Docker volume plugin) has a REST API:

$ curl -s http://localhost:3476/docker/cli
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia3 --device=/dev/nvidia2 --device=/dev/nvidia1 --device=/dev/nvidia0 --volume-driver=nvidia-docker --volume=nvidia_driver_361.48:/usr/local/nvidia:ro
$ docker run -ti --rm `curl -s http://localhost:3476/docker/cli` nvidia/cuda nvidia-smi

See our wiki

@tobegit3hub
Copy link

Thanks @flx42 . I found another way to do that by mounting the devices and cuda libraries. nvidia-docker is still the easiest way and thanks for all your contribution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants