-
Notifications
You must be signed in to change notification settings - Fork 2k
[SOLVED] How to run nvidia-docker with TensorFlow GPU docker #45
Comments
This looks somewhat related to #44 How you build it doesn't really matter, |
Yes, I highly recommend building the images manually. Especially since the Tensorflow code moves fast and the Docker images are now a bit old. |
@3XX0 and @flx42 thanks so much for the really quick reply. I went ahead with your advice and tried to build the docker images in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker. I ran the command: To which I get the error
I got the same error running, This may be a general docker error, but I've been searching and thinking about solutions and have found none. Thanks for your time. Hopefully, this will be helpful for others as well. |
Weird looks like SSL CA are outdated or something.
|
Good idea. Unfortunately, I still get the error:
It looks like the certs didn't get updated: Thanks for the suggestions. |
|
Not sure it looks like a DNS problem now: |
Interesting. I ran |
Also if I run |
I just tried, same issue here. |
I just inserted a |
@ruffsl really great idea! That worked, I built the image using
The interesting things is that if I run
There exists these two libcuda packages, but they are not called "libcuda.so", maybe this is just a naming issue? Hmmm... |
Just found your comment here: tensorflow/tensorflow#808 (comment) Which solved it! Thanks so much for the help @3XX0, @flx42, and @ruffsl. Really appreciate it. Brad |
For posterity here is the Dockerfile that eventually worked for me (based on the above feedback):
In particular the lines
and
were modified. |
No, thank you @bcordo , those are some small but working fixes. I just tested the Dockerfile below (to avoid another compile) with this example and its working fine. FROM b.gcr.io/tensorflow/tensorflow-devel-gpu
RUN ln -s /usr/local/nvidia/lib64/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
LABEL com.nvidia.volumes.needed="nvidia_driver"
LABEL com.nvidia.cuda.version="7.0" |
Hopefully, Tensorflow will fix these issues. |
Since Tensorflow 0.7 was released yesterday, the new images have the proper volume tags. But the Their GitHub is referencing old images: Those are the correct images: |
Agree that issues still exists. I had |
@ruffsl How does the LABEL work? Why do you have above the Is it not coming from the parent image "FROM nvidia/cuda" already? |
@jendap , The first tensorflow images up on b.gcr.io using the nvidia/cuda images where built before nvidia made the change to add the labels to use for the nvidia docker plugin. Tensorflow has since rebuilt these images, updating the parent image, inheriting the needed labels, thus what I wrote above should no longer be necessary. Hopefully the cuda linking issue in the tensorflow gpu images will also be resolved so we can go using stock builds from google. |
Cool. thanks. Cuda linking? Do you mean the "ln -s ..."? BTW: Are the labels supposed to tell me I have wrong cuda before it even starts the container? |
Linking or I think the label is only used to make sure your host's driver is compatible with the version of cuda in the container: /tools/src/nvidia-docker/utils.go#L56 |
That would be great if it would complain about the host driver being too old! I have to try it. |
Yes it helps us prevent running CUDA containers that are not supported by the driver. |
Copy the |
@philipz that's not a good idea to clobber the volume directory, not all containers are based on cuDNN v5. TensorFlow for instance is using cuDNN v4. By the way, the TensorFlow images now work just fine (partly because the
So, no symlinks needed I believe. |
@tobegit3hub Sure there is a way, our
See our wiki |
Thanks @flx42 . I found another way to do that by mounting the devices and cuda libraries. |
Thanks for releasing the nvidia-docker repo, this is a really great idea and very useful!
What I've Done
I have setup an equivalent of a Nvidia DIGITS machine (running Ubuntu 14.04 server), and am attempting to run everything in docker containers.
nvidia-docker run nvidia/cuda nvidia-smi
described here, and I see my 4 TitanX graphic cards.sudo -u nvidia-docker nvidia-docker-plugin -s /var/lib/nvidia-docker
and I get the output:which signifies to me that it's working.
My Problem
I first run
sudo -u nvidia-docker nvidia-docker-plugin -s /var/lib/nvidia-docker
in a tmux session.Then I run
nvidia-docker run -it -p 8888:8888 b.gcr.io/tensorflow/tensorflow-devel-gpu
it downloads everything and runs the docker container. Next I run ipython and try to import tensorflow but I get the following errors:**I think I just have a lack in understanding about how I should run the TensorFlow container, or maybe I have to build the container using nvidia-docker.
Any ideas about how to do this, or general advice about what I'm doing wrong would be amazing. **
Thanks so much.
Brad
The text was updated successfully, but these errors were encountered: