Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

libcuda.so can not be found in /usr/local/cuda/lib64 when building mxnet in nvidia/cuda docker #37

Closed
ps-account opened this issue Jan 21, 2016 · 10 comments
Labels

Comments

@ps-account
Copy link

I am trying to make a Dockerfile that compiles mxnet using nvidia-docker, based on the nvidia/cuda image. mxnet uses variable

USE_CUDA_PATH

in its make script to set the location of the cuda driver. It seems to ignore LD_LIBRARY_PATH
Usually you would set this to be /usr/local/cuda/lib64/ the libcudnn.so library can indeed be found there, for example.

In the nvidia/cuda Docker image however there is no libcuda.so in /usr/local/cuda/lib64, instead it seems to be located in /usr/local/nvidia/lib64/

Funnily enough, when I ln -s this libcuda.so.1 to /usr/local/cuda/lib64 it does build from within nvidia-docker run nvidia/cuda, but gives me a -lcuda not found error when performing the same command in "nvidia-docker build ..."

Is there a way to get libcuda.so in the /usr/local/cuda/lib64 directory during the nvidia-docker build?

@ps-account
Copy link
Author

I made a workaround by using the stub libcuda.so during build

At runtime I copy the /usr/local/nvidia/lib64/ before calling mxnet from R.

Did I do it in a correct way like this? Are there alternative ways to do this?

@3XX0
Copy link
Member

3XX0 commented Jan 21, 2016

Not sure why, but it looks like mxnet is using both the CUDA runtime API (libcudart.so) and the CUDA driver API (libcuda.so). libcudart.so is linked automatically by nvcc so you're fine with the CUDA runtime.
Regarding the CUDA driver though, it will only be present in the container at runtime (in /usr/local/nvidia/lib64) so as you figured it out, you will need to compile the code against the libcuda.so stub (/usr/local/cuda/lib64/stubs) when you build the container.

At runtime, you have two solution:

  1. If nothing has overridden LD_LIBRARY_PATH, you have nothing to do because the nvidia/cuda image sets it properly.
  2. If something tampered with LD_LIBRARY_PATH the easiest way is to execute ldconfig before your command:
CMD ldconfig && <MXNET_COMMAND>

@3XX0
Copy link
Member

3XX0 commented Jan 21, 2016

So after further review, we are missing the CUDA driver stubs in our CUDA images.
Not sure why, that's something we need to fix.

@ps-account
Copy link
Author

Thanks for the quick reply!

In the 7.5 image I only could find the cuda driver stubs at:

/usr/local/cuda-7.5/targets/x86_64-linux/lib/stubs/libcuda.so

I suppose there should be symbolic links at /usr/local/cuda etc.

I couldn't find any documentation on how to compile and then run code within the image, maybe an idea to put that somewhere in the README.md file?

I will try out the CMD ldconfig &&

@3XX0
Copy link
Member

3XX0 commented Jan 21, 2016

My bad my image was corrupted, we do include it.

Compiling/running code is done through your Dockerfile (see documentation)
In your case, I'm guessing it would look like that:

FROM nvidia/cuda:cudnn

RUN git clone <MXNET_REPO>

RUN sed <MXNET_CONFIG>
# Something along these lines
# ADD_LDFLAGS = -L /usr/local/cuda/lib64/stubs
# USE_CUDA = 1
# USE_CUDNN = 1

RUN make

CMD <MXNET_COMMAND>

@ps-account
Copy link
Author

Thanks for the helpful pointers!

The nvidia-docker wrapper works pretty great!

@ljstrnadiii
Copy link

@3XX0 , I am having a problem that relates. I use

FROM nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04

but there is no libcuda.so file to be found anywhere. I search:

sudo find /usr/ -name 'libcuda.so.1'

but no luck. Any idea of what I am doing wrong? It seems like tensorflow used to import in 1.0.0 but just say it couldn't find it. Now in 1.2.0, it will not even import.

@flx42
Copy link
Member

flx42 commented Jun 20, 2017

@ljstrnadiii is it during a docker build or docker run?
During a docker build, you can't use GPUs (nvidia-docker does nothing). But you can compile code against libcuda.so by using the stubs from the CUDA toolkit in /usr/local/cuda/lib64/stubs/

@ljstrnadiii
Copy link

ljstrnadiii commented Jun 20, 2017

@flx42 ,
During a docker run. For now, I am working in the docker image until debug everything. When I removed a WORKDIR in the dockerfile and built again the file is suddenly found here:

/usr/local/nvidia/lib64/libcuda.so.1

After exiting the gcp server and ssh back in I just ran the same container and suddenly nvidia-smi does not even work and libcuda.so.1 is no where to be found.

I am pretty confused. I wish there was tighter integration between nvidia and tensorflow.

I really just want to be able to build an image to run tf apps

EDIT: I guess I should start by calling nvidia-docker...

@flx42
Copy link
Member

flx42 commented Jun 20, 2017

Yes, you need to use nvidia-docker run

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants