Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

building images with nvidia-docker #595

Closed
cobie8a opened this issue Jan 5, 2018 · 15 comments
Closed

building images with nvidia-docker #595

cobie8a opened this issue Jan 5, 2018 · 15 comments

Comments

@cobie8a
Copy link

cobie8a commented Jan 5, 2018

1. Building images with nvidia-docker

2. In the past, I am able to just call nvidia-docker build. With version 2 requiring '--runtime' flag, Docker does not recognize it (although it works just fine as 'docker run --runtime'). I have not seen anything in the documentation regarding building with nvidia-docker version 2. Please advise.

@cobie8a
Copy link
Author

cobie8a commented Jan 5, 2018

I've followed the instructions [https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup] to register the runtime but still cannot set runtime default to 'nvidia'. I stop docker.service, and run 'sudo dockerd --default-runtime=nvidia &' which sets my runtime default to 'nvidia' but when I try to restart the service, it fails.

Please help!

@flx42
Copy link
Member

flx42 commented Jan 5, 2018

Do you need GPU support during docker build? If not, you can just use docker build.

With version 1.0, nvidia-docker build was not doing anything special.

@cobie8a
Copy link
Author

cobie8a commented Jan 6, 2018

Sounds good. I’ll give it a try. Primarily, I thought nvidia-docker provided gpu passthru support for building Caffe with GPU, which I do see on the build logs.

@flx42
Copy link
Member

flx42 commented Jan 6, 2018

You don't need to have a GPU machine to build a GPU project. The compiler (nvcc) doesn't need to run GPU code, it only needs to know which GPU families you will target.

@xkszltl
Copy link

xkszltl commented Mar 19, 2018

Actually I think it's really important the have runtime support for docker build.

The reason is for testing:
If we wanna run unit tests after compiling GPU-related tools, we'll have to get GPU access somehow.

@RuRo
Copy link

RuRo commented Feb 27, 2019

This is really quite important. Many tools require the presence of hardware to be configured correctly.
Please, either fix this, or provide us with a workaround for building with tools, that refuse to compile without libcuda etc.

@RenaudWasTaken
Copy link
Contributor

Set the default runtime to NVIDIA

@RuRo
Copy link

RuRo commented Feb 27, 2019

Set the default runtime to NVIDIA

I don't have access to /etc/docker/daemon.json on the system. I am assuming there is no 'per-user' default for this, since it's a daemon setting. Am I missing something?

@icolwell-as
Copy link

icolwell-as commented Aug 8, 2019

I ran into this same issue trying to compile something that uses tensorflow in a xenial-based image. tensorflow was complaining:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

I was able to get my docker builds to work by setting the default runtime as @RenaudWasTaken suggested. I didn't really know how to do this until I googled around figuring it out. Perhaps this may help others:

  1. Edit/create the /etc/docker/daemon.json with the below content:
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  1. Install nvidia-container-runtime package. I had followed the instructions here, but it seems nvidia-container-runtime isn't installed by default.
sudo apt-get install nvidia-container-runtime
  1. sudo systemctl restart docker.service
  2. Try your docker build again.

Related Links:
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
https://docs.nvidia.com/dgx/nvidia-container-runtime-upgrade/index.html#using-nv-container-runtime

@RenaudWasTaken
Copy link
Contributor

Another solution if your docker build is just doing compilation is to use the stubs in /usr/local/cuda/lib64/stubs/

@dancingpipi
Copy link

I ran into this same issue trying to compile something that uses tensorflow in a xenial-based image. tensorflow was complaining:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

I was able to get my docker builds to work by setting the default runtime as @RenaudWasTaken suggested. I didn't really know how to do this until I googled around figuring it out. Perhaps this may help others:

  1. Edit/create the /etc/docker/daemon.json with the below content:
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  1. Install nvidia-container-runtime package. I had followed the instructions here, but it seems nvidia-container-runtime isn't installed by default.
sudo apt-get install nvidia-container-runtime
  1. sudo systemctl restart docker.service
  2. Try your docker build again.

Related Links:
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
https://docs.nvidia.com/dgx/nvidia-container-runtime-upgrade/index.html#using-nv-container-runtime

mark!

@RenaudWasTaken
Copy link
Contributor

RenaudWasTaken commented Oct 21, 2019

@z13974509906 the recommended path is to build CUDA code during docker build time and run CUDA code during docker run time :)

You wouldn't need libcuda.so in that case and can use the stubs at build time.

@kevindoran
Copy link

To build using the stubs, you need to make the stubs path known to the linker. One option is to add the path to the LIBRARY_PATH environmental variable. (LD_LIBRARY_PATH is for runtime linking, whereas LIBRARY_PATH is used for compile time linking). Example:

ENV LIBRARY_PATH $LIBRARY_PATH:/usr/local/cuda/lib64/stubs

@wuyuanyi135
Copy link

I ran into this same issue trying to compile something that uses tensorflow in a xenial-based image. tensorflow was complaining:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

I was able to get my docker builds to work by setting the default runtime as @RenaudWasTaken suggested. I didn't really know how to do this until I googled around figuring it out. Perhaps this may help others:

  1. Edit/create the /etc/docker/daemon.json with the below content:
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
  1. Install nvidia-container-runtime package. I had followed the instructions here, but it seems nvidia-container-runtime isn't installed by default.
sudo apt-get install nvidia-container-runtime
  1. sudo systemctl restart docker.service
  2. Try your docker build again.

Related Links:
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
https://docs.nvidia.com/dgx/nvidia-container-runtime-upgrade/index.html#using-nv-container-runtime

This solved my problem where torch.cuda.is_available() returns False.

@xkszltl
Copy link

xkszltl commented Nov 16, 2020

@icolwell-as

...

  1. sudo systemctl restart docker.service
  2. Try your docker build again.

Related Links:
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup
https://docs.nvidia.com/dgx/nvidia-container-runtime-upgrade/index.html#using-nv-container-runtime

You don't need to restart the daemon, sudo killall -s HUP dockerd is usually enough.
Unlike it sounds, it won't kill anything.
It will send SIGHUP to dockerd and signal handler will reload the config json.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants