Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

cmake fails unable to find cuda library while building an image #1033

Closed
9 tasks
dittothat opened this issue Jul 31, 2019 · 7 comments
Closed
9 tasks

cmake fails unable to find cuda library while building an image #1033

dittothat opened this issue Jul 31, 2019 · 7 comments

Comments

@dittothat
Copy link

dittothat commented Jul 31, 2019

1. Issue or feature description

I have created a Dockerfile to containerize some medical image processing code. With NVIDIA-docker2 I was able to use the file to generate an image without issue. When I attempt to build that image on a different machine with the latest Docker (19.03) and the latest NVIDIA-docker, it fails on step 8/8 when cmake cannot find the CUDA_CUDA_LIBRARY. When I run the step 7/8 image in the bash shell I can copy and paste the cmake command (line 86 of the Dockerfile) that failed during build, and it configures and then compiles fine in the image. My conception of Docker containers and images is being strained by this issue. I don't understand why a RUN command could fail while the same command run in the image would work.

Debugging a bit in the image I initialize with "docker run --gpus all -it /bin/bash", cmake finds CUDA_CUDA_LIBRARY in the image at /usr/lib/x86_64-linux-gnu/libcuda.so, but when I hard code that location in the Dockerfile cmake command (i.e. I use the commented line 87 in the Dockerfile I link above) cmake gives the error "No rule to make target '/usr/lib/x86_64-linux-gnu/libcuda.so', needed by '../bin/SVRreconstructionGPU'." which makes me believe that library actually doesn't exist in the "build image".

2. Steps to reproduce the issue

git clone git@github.com:dittothat/dockerfetalrecon.git
cd dockerfetalrecon
docker build -t fetalrecon .
This will fail when cmake cannot find the cuda libraries needed to compile.
Comment out line 86 and uncomment line 87 in the Dockerfile
docker build -t fetalrecon .
This will fail when the library really cannot be found.
Now initialize the image created by step 7/8 in build:
docker run --gpus all -it fetalrecon /bin/bash
Then in the container:
cd /usr/src/fetalReconstruction/source/build
cmake -DCUDA_SDK_ROOT_DIR:PATH=/usr/local/cuda-9.1/samples ..
make
Everything compiles just fine (though sometimes I must run make again to fix the linking error with niftiio toward the end, still trying to figure out what is going on there).

3. Information to attach (optional if deemed irrelevant)

@kiendang
Copy link

kiendang commented Aug 3, 2019

To use nvidia runtime with docker build you need to make it the default runtime. Just put "default-runtime": "nvidia" in /etc/docker/daemon.json.

@guptaNswati
Copy link
Contributor

guptaNswati commented Aug 6, 2019

Yes, the library wont be present during the build time unless you mount it inside the container. You can either do a docker run --gpus and do the rest of the build inside the container and then do a docker commit. Or use a -v option to manually mount it. Hope this helps. Closing now.

@dittothat
Copy link
Author

dittothat commented Aug 9, 2019

To use nvidia runtime with docker build you need to make it the default runtime. Just put "default-runtime": "nvidia" in /etc/docker/daemon.json.

Ok, this makes sense. I tried adding the line to suggest to daemon.json, but docker will not start with this modified config file. With the latest nvidia-docker working with Docker 19.03.1, the nvidia runtime doesn't appear to be registered (i.e. dockerd --default-runtime=nvidia returns specified default runtime 'nvidia' does not exist). I am cautious to rely on the documentation in the wiki given that it now spans three nvidia-docker versions. Is it necessary and are there updated instructions for registering the the nividia runtime with the latest nvidia-docker? I suppose editing daemon.json as described may no longer be the accepted method for configuring the default runtime during docker build.

Yes, the library wont be present during the build time unless you mount it inside the container.

Can you give any more details about where to find the appropriate library to mount and to compile against? Since the beauty of nividia-docker is that it is host driver agnostic to some extent, it seems to me that that CUDA libraries I mount should correspond to the CUDA version in the specific nividia-docker image I have selected. Perhaps I am wrong.

docker run --gpus and do the rest of the build inside the container and then do a docker commit

This seems to deviate wildly from best docker practices. I know it will work, but I would love to get docker build and a Dockerfile working properly for my use. That the CUDA libraries are not mounted during build seems like a problem to me.

@kiendang
Copy link

kiendang commented Aug 9, 2019

Ok, this makes sense. I tried adding the line to suggest to daemon.json, but docker will not start with this modified config file. With the latest nvidia-docker working with Docker 19.03.1, the nvidia runtime doesn't appear to be registered (i.e. dockerd --default-runtime=nvidia returns specified default runtime 'nvidia' does not exist). I am cautious to rely on the documentation in the wiki given that it now spans three nvidia-docker versions. Is it necessary and are there updated instructions for registering the the nividia runtime with the latest nvidia-docker? I suppose editing daemon.json as described may no longer be the accepted method for configuring the default runtime during docker build.

You have to install the nvidia-container-runtime package in your host. Then put this inside daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

@dittothat
Copy link
Author

That's excellent, it worked nicely. Thank you very much for your help.

@reconlabs-sergio
Copy link

reconlabs-sergio commented Sep 1, 2023

For the life of me, I cannot get this approach to work. Is there something that overrides the default runtime? Is there a way to debug which runtime is getting used?
I'm starting FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel and I need cuda available to compile a few libraries in the build stage. If I run the container, it's there, but cuda will not be available during the build stage.
Any clues?

@reconlabs-sergio
Copy link

reconlabs-sergio commented Sep 1, 2023

I didn't know I should restart the deamon for changes to take place

After modifying the deamon.json, sudo systemctl restart docker did the trick.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants