Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

docker daemon fails to start with Nvidia runtime container #761

Closed
1 task
ghost opened this issue Jun 9, 2018 · 15 comments
Closed
1 task

docker daemon fails to start with Nvidia runtime container #761

ghost opened this issue Jun 9, 2018 · 15 comments

Comments

@ghost
Copy link

ghost commented Jun 9, 2018

1. Issue or feature description

My setup was working fine, and suddenly docker stopped working.

2. Steps to reproduce the issue

just did "sudo apt-get update"
My files look like this now:
dtlu@dtlu16:$ sudo cat /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
dtlu@dtlu16:
$ sudo cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
{
“dns”: [“172.20.130.181”]
}

3. Information to attach (optional if deemed irrelevant)

  • Kernel version from uname -a
    dtlu@dtlu16:$ uname -a
    Linux dtlu16 **4.13.0-43-generic Add support for cross-device volumes #48
    16.04.1-Ubuntu** SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
    dtlu@dtlu16:~$

dtlu@dtlu16:$ systemctl daemon-reload
==== AUTHENTICATING FOR org.freedesktop.systemd1.reload-daemon ===
Authentication is required to reload the systemd state.
Authenticating as: dtlu,,, (dtlu)
Password:
==== AUTHENTICATION COMPLETE ===
dtlu@dtlu16:
$ sudo service docker status
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─override.conf
Active: inactive (dead) (Result: exit-code) since Sat 2018-06-09 01:16:10 PDT; 21min ago
Docs: https://docs.docker.com
Main PID: 2299 (code=exited, status=1/FAILURE)

Jun 09 01:16:10 dtlu16 systemd[1]: Failed to start Docker Application Container Engine.
Jun 09 01:16:10 dtlu16 systemd[1]: docker.service: Unit entered failed state.
Jun 09 01:16:10 dtlu16 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 09 01:16:10 dtlu16 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 09 01:16:10 dtlu16 systemd[1]: Stopped Docker Application Container Engine.
Jun 09 01:16:10 dtlu16 systemd[1]: docker.service: Start request repeated too quickly.
Jun 09 01:16:10 dtlu16 systemd[1]: Failed to start Docker Application Container Engine.
dtlu@dtlu16:~$

sudo docker version
Client:
Version: 18.03.1-ce
API version: 1.37
Go version: go1.9.5
Git commit: 9ee9f40
Built: Thu Apr 26 07:17:20 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

@johnny06
Copy link

johnny06 commented Jun 9, 2018 via email

@flx42
Copy link
Member

flx42 commented Jun 9, 2018

#749 (comment)
Did you manually edit the override.conf file?

@flx42
Copy link
Member

flx42 commented Jun 9, 2018

Make sure to systemctl daemon-reload and systemctl reload docker

@artificialbrains
Copy link

I had a similar issue today after having to reboot my computer.

uname -a
Linux pcvp19 4.4.0-124-generic #148-Ubuntu SMP Wed May 2 13:00:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

systemctl status nvidia-docker
Warning: nvidia-docker.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● nvidia-docker.service
Loaded: masked (/dev/null; bad)
Active: inactive (dead)
pcvp@pcvp19:/var$ sudo gvim /lib/systemd/system/docker.service
pcvp@pcvp19:/var$ systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─override.conf
Active: inactive (dead) (Result: exit-code) since Tue 2018-06-12 10:04:57 PDT; 12s ago
Docs: https://docs.docker.com
Process: 2910 ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime (code=exited, status=1/FAILURE)
Main PID: 2910 (code=exited, status=1/FAILURE)

Jun 12 10:04:57 pcvp19 systemd[1]: Failed to start Docker Application Container Engine.
Jun 12 10:04:57 pcvp19 systemd[1]: docker.service: Unit entered failed state.
Jun 12 10:04:57 pcvp19 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 12 10:04:57 pcvp19 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 12 10:04:57 pcvp19 systemd[1]: Stopped Docker Application Container Engine.
Jun 12 10:04:57 pcvp19 systemd[1]: docker.service: Start request repeated too quickly.
Jun 12 10:04:57 pcvp19 systemd[1]: Failed to start Docker Application Container Engine.

systemctl daemon-reload and systemctl reload docker left the following message.
docker.service is not active, cannot reload.

@flx42
Copy link
Member

flx42 commented Jun 12, 2018

Did you edit the override.conf file?

@artificialbrains
Copy link

No.

I was not even aware of override.conf until looking at the error message. I assume the file got created as part of the nvidia-docker install?

Here is override.conf contents:
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime

@flx42
Copy link
Member

flx42 commented Jun 12, 2018

@artificialbrains this is weird, did you setup the machine yourself?
What is the output of dpkg -S /etc/systemd/system/docker.service.d/override.conf?

@artificialbrains
Copy link

I was given the machine a while ago, but have been maintaining since.

Here is what I got from running the dpkg command:

sudo dpkg -S /etc/systemd/system/docker.service.d/override.conf
dpkg-query: no path found matching pattern /etc/systemd/system/docker.service.d/override.conf

However, dpkg command may have been affect by the fact that I just tried to remove nvidia-docker2 using sudo apt-get update and sudo apt-get remove nvidia-docker2. I want to uninstall and reinstall nvidia-docker2 to see if the problem gets fixed.

@artificialbrains
Copy link

The following actions fixed the issue for me.

I uninstalled nvidia-docker using:
sudo apt-get update
sudo apt-get remove nvidia-docker2

Rebooted the computer.
Docker started to work again. Previously, docker did not start correctly with just the reboot alone.

Installed nvidia-docker2
sudo apt-get update
sudo apt-get install nvidia-docker2
Nvidia-docker is now operable.

@artificialbrains
Copy link

Sadly, I still have the same issue with override.conf. I pushed my luck and rebooted my machine to find docker fails to start because of override.conf.

If I removed override.conf, docker and nvidia-docker services start correctly; however, the nvidia-docker fails to open previously constructed docker containers.

As a temporary solution, I removed nvidia-docker2, rebooted my machine, and reinstalled nvidia-docker2. This will work till I have to reboot my machine again.

@flx42
Copy link
Member

flx42 commented Jun 12, 2018

The override.conf file is not installed by nvidia-docker. It's likely that someone manually setup the machine this way. You can do systemctl edit docker then remove the --add-runtime=... part.

@artificialbrains
Copy link

Thanks!

That fixed the problem. I am now able to reboot the machine and nvidia-docker/docker both function properly.

@d2sys
Copy link

d2sys commented Jun 19, 2018

I had the same problem and the fix proposed by @flx42 worked for me!
Thanks!

@flx42
Copy link
Member

flx42 commented Jun 21, 2018

Closing this issue for now, but I'm surprised so many people are facing this issue. I'm wondering if another package conflicts with our settings.

@mmdzzh
Copy link

mmdzzh commented Oct 6, 2023

hello,i meet this problem

Closing this issue for now, but I'm surprised so many people are facing this issue. I'm wondering if another package conflicts with our settings.

hello, i met this problem now. but in my override.conf i dont find --add-runtime part. and when i use "sudo service restart docker" then error occured.

my docker version is 24.0.6 and i install nvidia-docker2

i need your help, thank you

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants