-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker containers fail to start after node restart #1062
Comments
@thaJeztah i saw you were active on the previous issue, maybe you can guide me |
Could it be a timing issue? Is the extra disk mounted before docker and containerd start? |
I recall this PR in containerd that updated the systemd unit for a similar thing; containerd/containerd#3741 Not sure if that was backported to containerd 1.2.x, but you can add that to a systemd drop-in/override file |
We thought that it might be timing, so we added to docker systemd file that it should run after our mount systemd ({}.mount) Maybe i should add it to the containerd systemd instead? |
Updating that i've added the "AFTER" to containerd but unfortunately this still happens.
` Then i do systemctl start docker and it comes back alive but only some of the containers are able to start.
the docker start for that container fails with the following
` Forcing me to recreate them |
@thaJeztah any idea on anything else we can do for debug / fix here? |
On mobile currently, but the error looks to me like you don't have /run in a tmpfs. We expect /run to be a tmpfs so these files can be cleaned up on reboot. |
I further investigated the logs and seems like not relevant to the mounting point. I think that maybe some other startup processes are heavy on memory but ContainerD is the one getting killed. Maybe we can delay the start of containerd service or should we just adjust OOM score instead to preserve containerd during startup? Wdyt?
|
Hmm.. interesting; I recall that containerd defaulted to having a negative $ cat /proc/$(pidof containerd)/oom_score_adj
0
$ cat /proc/$(pidof dockerd)/oom_score_adj
-500 The containerd shims also have a score set; $ for p in $(pidof containerd-shim); do cat /proc/$p/oom_score_adj; done
-999
-999 |
Looks like the default Stop the docker and containerd systemd service systemctl stop docker
systemctl stop containerd Manually start dockerd --debug & ps auxf | grep docker
root 172861 0.0 0.0 6432 736 pts/0 S+ 09:55 0:00 | \_ grep --color=auto docker
root 172701 0.1 1.0 1230796 82952 pts/1 Sl+ 09:51 0:00 \_ dockerd --debug
root 172711 0.2 0.5 1039336 43500 ? Ssl 09:51 0:00 \_ containerd --config /var/run/docker/containerd/containerd.toml --log-level debug Check the config file that was written; cat /var/run/docker/containerd/containerd.toml root = "/var/lib/docker/containerd/daemon"
state = "/var/run/docker/containerd/daemon"
plugin_dir = ""
disabled_plugins = ["cri"]
oom_score = -500
[grpc]
address = "/var/run/docker/containerd/containerd.sock"
tcp_address = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[debug]
address = "/var/run/docker/containerd/containerd-debug.sock"
uid = 0
gid = 0
level = "debug"
[metrics]
address = ""
grpc_histogram = false
[cgroup]
path = ""
[plugins]
[plugins.linux]
shim = "containerd-shim"
runtime = "runc"
runtime_root = "/var/lib/docker/runc"
no_shim = false
shim_debug = true And verified that the oom-score is cat /proc/$(pidof containerd)/oom_score_adj
-500 |
Opened containerd/containerd#4409, but that likely won't make its way to 1.2.x. or 1.3.x releases of containerd, so perhaps we should set this value in the containerd systemd unit |
@thaJeztah What would be the preferred hotfix at the moment? |
you can add a systemd drop-in/override file to add it to containerd's service configuration. Similar to docker/containerd-packaging#186 |
Expected behavior
Actual behavior
Steps to reproduce the behavior
This is something that happens for us almost every day.
The machines are being shutdown using an automated script, when they are started again in the morning some of the machines docker service fail to start.
We start it again using systemctl and then docker service starts but many containers are unable to start.
1 major diff from my other environments - This is running on an Azure VM, that has data disk. Due to performance issues Azure recommended not using the OS disk, but rather using the Data disk to host docker as well.
so in daemon.json we setup "data-root": to be on the mounted data disk.
FSTAB config:
UUID={REDACTED} /mnt ext4 defaults,nofail 1 2
Journalctl show the following logs:
similar to #597
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Azure VM, Ubuntu 18.04
1 major diff from my other environments - This is running on an Azure VM, that has data disk. Due to performance issues Azure recommended not using the OS disk, but rather using the Data disk to host docker as well.
so in daemon.json we setup "data-root": to be on the mounted data disk.
FSTAB config:
UUID={REDACTED} /mnt ext4 defaults,nofail 1 2
The text was updated successfully, but these errors were encountered: