Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Link Issue on volume creation #133

Closed
guilhermehartmann opened this issue Jul 8, 2016 · 15 comments
Closed

Link Issue on volume creation #133

guilhermehartmann opened this issue Jul 8, 2016 · 15 comments

Comments

@guilhermehartmann
Copy link

While creating the volume I get this issue, seems to be caused by hard link crossing two different partitions. Is this a know issue ?

sudo nvidia-docker volume setupnvidia-docker-plugin | 2016/07/08 02:12:39 Received remove request for volume 'nvidia_driver_367.27'
nvidia-docker run --rm nvidia/cuda nvidia-sminvidia-docker-plugin | 2016/07/08 02:12:52 Received create request for volume 'nvidia_driver_367.27'
nvidia-docker-plugin | 2016/07/08 02:12:52 Error: link /usr/bin/nvidia-cuda-mps-control /var/lib/nvidia-docker/volumes/nvidia_driver/367.27/bin/nvidia-cuda-mps-control: invalid cross-device link

@flx42
Copy link
Member

flx42 commented Jul 8, 2016

First of all, you shouldn't use volume setup, we removed this command from our latest version. You should use nvidia-docker-plugin (started automatically if you install nvidia-docker using the deb or the rpm).

And yes, it's a known limitation:
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker-plugin#known-limitations
You can use the -d option of nvidia-docker-plugin to change the path for the volume.

@guilhermehartmann
Copy link
Author

Thanks, I missed the limitations bit. Opted to use /usr/local/nvidia-docker as the default volume

@dpatschke
Copy link

I am experiencing this problem as well because I have '/var' on a separate partition from '/usr' where the nvidia drivers are located. I would like to switch the default volume location to a folder in '/usr' as the workaround suggests. However, I cannot, for the life of me, figure out how to accomplish this using nvidia-driver-plugin -d.

I am running:
sudo nvidia-docker-plugin -d /usr/local/nvidia-driver

and the change appears to be taking place,

nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 20:00:49 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 20:00:49 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 20:00:50 Provisioning volumes at /usr/local/nvidia-driver
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 20:00:50 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use

but then I run this:

sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

and I still get this error

docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.

@flx42, would you be able to point me in the right direction? or, @guilhermehartmann, how were you able to use /usr/local/nvidia-driver as your default volume?

Apologies as I am new to Docker and it seems I have jumped into the deep end of the pool :-).

Thanks!

@flx42
Copy link
Member

flx42 commented Jul 21, 2016

@dpatschke Look at your log after running nvidia-docker-plugin -d [...] it failed:

nvidia-docker-plugin | 2016/07/20 20:00:50 Error: listen tcp 127.0.0.1:3476: bind: address already in use

This is because the nvidia-docker service is still running, so you're still using the other version of the plugin, without the -d. You should try to modify your service configuration file directly, which OS are you on?

@dpatschke
Copy link

Thank you for your response, @flx42.

I am running Ubuntu 16.04. I would love to be able to modify some configuration file and restart docker or the nvidia-docker-plugin or whatever, but have been scouring the web and message boards for hours and can't seem to find what I am looking for.

Would you be able to point me to the correct config file to modify? Also, I have no idea how nvidia-docker-plugin is running in the first place. Is the plugin launched when the docker service is started? How do I stop the current plugin and 'restart' one with the 'd' option?

Thank you very much for your help!!

David

@flx42
Copy link
Member

flx42 commented Jul 21, 2016

Something like that

# systemctl edit nvidia-docker

[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR -d /usr/local/nvidia-driver

@dpatschke
Copy link

dpatschke commented Jul 21, 2016

Thank you @flx42 ... unfortunately, I could not get the problem resolved.

I executed the 'edit' command as you suggested and created the file with what you had listed. The 'nano' editor wanted to save it as 'override.conf' with a bunch of additional characters at the end.

I ended up saving the file as /etc/systemd/system/nvidia-docker.service.d/override.conf.

I then restarted the systemd service:
sudo systemctl restart nvidia-docker

I am still getting the old folder when I issue the command:
sudo nvidia-docker-plugin

Here is the output:

nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA unified memory
nvidia-docker-plugin | 2016/07/20 23:02:53 Loading NVIDIA management library
nvidia-docker-plugin | 2016/07/20 23:02:53 Discovering GPU devices
nvidia-docker-plugin | 2016/07/20 23:02:53 Provisioning volumes at /var/lib/nvidia-docker/volumes
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving plugin API at /run/docker/plugins
nvidia-docker-plugin | 2016/07/20 23:02:53 Serving remote API at localhost:3476
nvidia-docker-plugin | 2016/07/20 23:02:53 Error: listen tcp 127.0.0.1:3476: bind: address already in use

When I issue the command:
sudo systemctl edit nvidia-docker
I am seeing the new file I created.

Now, when I issue the following command, though:
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi

I get the following error:
docker: Error response from daemon: create nvidia_driver_367.35: create nvidia_driver_367.35: Error looking up volume plugin nvidia-docker: plugin not found.

@flx42
Copy link
Member

flx42 commented Jul 21, 2016

Don't try to start nvidia-docker-plugin manually, it's handled by systemd.
Try to restart the docker service too.

@dpatschke
Copy link

dpatschke commented Jul 21, 2016

Yeah ... did a a sudo service docker restart and still getting the same result - 'plugin not found'. Restarted entire system ... same result.

When I do a sudo nvidia-docker volume ls it is completely empty. I seem to remember reading somewhere that there should be something present.

I am also stlll getting the 'address already in use' error as well.

I don't know where things went wrong but any other suggestions/recommendations would be greatly appreciated.

David

@flx42
Copy link
Member

flx42 commented Jul 21, 2016

@dpatschke: give me the output of

journalctl -n -u nvidia-docker

@dpatschke
Copy link

Looks like I didn't have the nvidia-docker service started last time. Started it up again, but was still erroring out.

Here is the output from your recommended command:

 Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Starting NVIDIA Docker plugin...
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA unified memory
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Loading NVIDIA management library
Jul 21 00:06:04 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:04 Discovering GPU devices
Jul 21 00:06:04 Precision-Tower-7910 systemd[1]: Started NVIDIA Docker plugin.
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Provisioning volumes at /usr/local/nvidia-driver
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving plugin API at /var/lib/nvidia-docker
Jul 21 00:06:05 Precision-Tower-7910 nvidia-docker-plugin[4360]: /usr/bin/nvidia-docker-plugin | 2016/07/21 00:06:05 Serving remote API at localhost:3476

This looks good, doesn't it? Still getting this error, though, when actually trying to launch nvidia-docker:
docker: Error response from daemon: create nvidia_driver_367.35: VolumeDriver.Create: internal error, check logs for details.

@flx42
Copy link
Member

flx42 commented Jul 21, 2016

@dpatschke yes it looks good.
At this point, I would advise you to simply purge nvidia-docker the hard way:

apt-get purge nvidia-docker
rm -rf /var/lib/nvidia-docker

Then restart docker, reinstall nvidia-docker from the deb, edit the systemd configuration file again, reboot.

If you still have the problem after that, please file a new bug with the new output of journalctl -n -u nvidia-docker.

@dpatschke
Copy link

OK, will do it again ... thank you so much for your help and guidance!

@Mr-Grieves
Copy link

Not sure if anyone will find this useful, but there was the one last step I had to do to get this working:

Ensure that the directory specified by the -d in the systemd config file exists and is owned by nvidia-docker:

mkdir /usr/local/nvidia-driver
chown -hR nvidia-docker /usr/local/nvidia-driver
chgrp nvidia-docker /usr/local/nvidia-driver

@qiaohaijun
Copy link

this solution help me a lot.

I use centos7.2 with k40c x 4

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants