live-restore does not work on latest version 18.09.1 #556

BSWANG · 2019-01-14T01:26:04Z

This is a bug report
This is a feature request
I searched existing issues before opening this one

Expected behavior

Docker don't stop containers when restart docker daemon on "live-restore" is open.

Actual behavior

Docker stop the containers when restart docker daemon.

[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker run -itd --name test-live-restore busybox
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
57c14dd66db0: Pull complete
Digest: sha256:7964ad52e396a6e045c39b5a44438424ac52e12e4d5a25d94895f2058cb863a0
Status: Downloaded newer image for busybox:latest
5c5530032d8327c96eb5db163830ee08b7853055d3dfc5c5fb7277da3dc2df91
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
5c5530032d83        busybox             "sh"                20 seconds ago      Up 19 seconds                           test-live-restore
[root@iZbp10z0xdiqguldb5kg9vZ ~]# systemctl restart docker
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                        PORTS               NAMES
5c5530032d83        busybox             "sh"                48 seconds ago      Exited (255) 15 seconds ago                       test-live-restore
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker info | grep live-restore
WARNING: bridge-nf-call-ip6tables is disabled
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker info | grep -i  live
WARNING: bridge-nf-call-ip6tables is disabled
Live Restore Enabled: true

Steps to reproduce the behavior

Set the docker live-restore open
Create and start a test container
Restart docker daemon by systemctl restart docker

Output of docker version:

Client:
 Version:           18.09.1
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        4c52b90
 Built:             Wed Jan  9 19:35:01 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       4c52b90
  Built:            Wed Jan  9 19:06:30 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 11
Server Version: 18.09.1
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 96ec2177ae841256168fcf76954f7177af9446eb
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.639GiB
Name: iZbp10z0xdiqguldb5kg9vZ
ID: 7VBE:VOCD:XSDU:FDNH:P6IB:EKL7:3J5U:CCDW:UULW:R2EF:S7ER:EENA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://pqbap4ya.mirror.aliyuncs.com/
Live Restore Enabled: true
Product License: Community Engine

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.)
AlibabaCloud

Actually, the process of container is not stopped, but dockerd/containerd can not take over it.

Then I create a container after restart. The container's containerd-shim arguments is not same to the first container.

root     22306     1  0 09:12 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/5c5530032d8327c96eb5db163830ee08b7853055d3dfc5c5fb7277da3dc2df91 -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup
root     22782 10917  0 09:21 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/83ee16c7ed3494d7c2850391983bcc14b053e3ea0545589700ec3c226fda282d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup

The workdir diff of two containerd-shim:
/var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/...
/var/lib/containerd/io.containerd.runtime.v1.linux/...

The text was updated successfully, but these errors were encountered:

thaJeztah · 2019-01-14T02:51:04Z

ping @crosbymichael @seemethere - I think this might be related to socket activation. (as I recall it, live-restore was the original reason for removing it in the RPM packages)

/cc @andrewhsu

thaJeztah · 2019-01-14T02:53:07Z

Actually .. wondering if it's a race condition between the containerd service starting, and the dockerd daemon starting it as a child process; I had another issue mentioning that 🤔

BSWANG · 2019-01-14T03:05:09Z

@thaJeztah Thanks, I just test manually start containerd service before dockerd service, this issue not happen.
It should be a race condition.

crosbymichael · 2019-01-14T18:07:04Z

I think this happens now because the docker service depends on the containerd service or has an After= on the containerd service.

seemethere · 2019-01-14T21:33:54Z

I think that if we add an After=containerd.service that would solve for this use case since it would only start the docker service after containerd has successfully started.

corbin-coleman · 2019-01-14T23:39:34Z

This PR should fix the issue docker/docker-ce-packaging#290

It's going into the master branch first, so it'll appear in the next nightly release after the PR is merged in.

buck2202 · 2019-01-25T10:02:15Z

This doesn't seem to reliably fix the issue for me. I added containerd.service to the After= line in docker.service, but I'm still intermittently seeing the behavior reported above. Docker reports that the container has exited with code 255, but (some) processes formerly associated with the container are still running. When this happens to me, the entrypoint (a bash wrapper to an executable that sets paths and such) actually does die in the service restart, but the executable that it calls is still running.

I'm using the google cloud ubuntu xenial image, with docker version 18.09.1 build 4c52b90

thaJeztah · 2019-01-25T13:19:01Z

It's not yet in the 18.09.1 packages (opened a backport in docker/docker-ce-packaging#294)

However, you can use an override file to add the change (without modifying the original docker.service; which you should never do);

The easiest way to create an override file is using systemctl edit;

Run systemctl edit docker.service. This creates an empty override file, and opens it in your editor;
```
systemctl edit docker.service
```
Edit the file to specify the new After (the After option can be specified multiple times, so there's no need to include what's already in the main docker.service file). The file should look something like this:
```
[Unit]
# Make sure containerd is running before starting dockerd
After=containerd.service
```
Reload the systemd configuration with systemctl daemon-reload
Verify if the changes look ok, using systemctl cat docker.service. This will show the contents of both the main docker.service unit, and any override file that exists. In the output below you see both files are loaded, and the override.conf appends containerd.service to the After=:
```
# /lib/systemd/system/docker.service
[Unit]
...
BindsTo=containerd.service
After=network-online.target firewalld.service

...


# /etc/systemd/system/docker.service.d/override.conf
[Unit]
# Make sure containerd is running before starting dockerd
After=containerd.service
```

To confirm that the effective After looks like you want, use systemctl show; you can filter to just the property you're interested in (After in this case):

systemctl show --property=After docker.service

After=network-online.target firewalld.service system.slice docker.socket systemd-journald.socket basic.target sysinit.target containerd.service

Restart the docker service to make the new change active systemctl restart docker.service

buck2202 · 2019-01-25T22:15:14Z

Ok, I did modify the docker.service file directly to match the referenced PR above, and held updates to the docker-ce package in dpkg to prevent any conflicts while I waited for the backport. To be clear, it does seem to be generally better, but once (out of 10s of service restarts across multiple hosts) I saw it lose track of all containers again when the host system was under high load.

It seemed possible to me that some less-likely race condition might still exist, but I will change my approach and keep an eye on it.

buck2202 · 2019-01-26T23:15:50Z

Just confirming that I do still see this issue happen sporadically after restoring the original docker.service and creating the override as you described.

thaJeztah · 2019-01-28T11:57:42Z

Ok, so there's one more option to prevent any possible race condition.

The daemon has a --containerd option that allows setting the path of the containerd socket. If that option is set (either as flag or through the daemon.json configuration file), the daemon will disable monitoring of containerd, and in cases where the containerd is not present, "fail" instead of trying to start a containerd instance as child-process.

So, there's two approaches to configure this. Generally, the daemon.json approach is recommended, as it will also be used when manually starting the dockerd process, whereas the systemd approach will only take effect if the daemon is started through systemd.

You cannot use both methods - you need to pick one. Setting the option both as a daemon flag (--containerd), and in the daemon.json configuration file will result in a "conflicting options", and the daemon will refuse to start.

Option 1 - using the `daemon.json` configuration file;

Create a /etc/docker/daemon.json file if it does not yet exist
Edit the configuration file, and add the "containerd" setting

The file must be valid JSON; if you have no other configurations set in your daemon.json, the file will look like this:
```
{"containerd": "/run/containerd/containerd.sock"}
```
Restart the docker service
```
systemctl restart docker.service
```

Option 2. - using the systemd override file:

Run systemctl edit docker.service. This opens the override file in your editor;

Edit the file to specify the new ExecStart Service options. To override the existing ExecStart, the original one first has to be "reset" (otherwise another ExecStart is added). If you made the changes from my previous comment, the file should look something like this:

[Unit]
# Make sure containerd is running before starting dockerd
After=containerd.service

[Service]
# Disable managing containerd, and require the containerd socket
# to be present at the given location

# First reset the existing ExecStart
ExecStart=

# Then set the new ExecStart
ExecStart=/usr/bin/dockerd -H fd://  --containerd=/run/containerd/containerd.sock

Reload the systemd configuration with systemctl daemon-reload

Confirm that the configuration looks as expected

systemctl cat docker.service | grep ExecStart=

ExecStart=/usr/bin/dockerd -H fd://
ExecStart=
ExecStart=/usr/bin/dockerd -H fd://  --containerd=/run/containerd/containerd.sock

Restart the docker service to make the new change active systemctl restart docker.service

buck2202 · 2019-02-01T21:21:20Z

I pushed these changes to 20+ VMs through the systemd override file, and so far (~48h), I've had no cases of containers getting lost by docker during service restarts. The After= addition decreased the frequency of the issue, but was not sufficient on its own in my case.

thaJeztah · 2019-02-01T22:27:16Z

Thanks @buck2202

@seemethere ^^ looks like we may want to consider using the --containerd option instead.

buck2202 · 2019-02-09T21:11:37Z

Just checking back in--there were zero issues in ~1.5 weeks of continuous use across ~30 VMs with both After= and ExecStart= configured in the systemd override.

Daemon restarts would have been occurring 1-2 times/hour on each VM, and I had previously setup scripts to log all instances where docker reported a container had exited but the container's executable was still active.

thaJeztah · 2019-02-11T13:44:48Z

Thanks for testing!

I opened docker/docker-ce-packaging#297 and docker/docker-ce-packaging#298 for consideration

remoteweb · 2019-06-10T13:17:30Z

Sorry if this has been covered by @buck2202 already, however i am testing docker v. 18.09.2 with an interactive bash container and when i run systemctl restart docker, bash exits with an error message*, even though docker ps reporting the container never exited.

I have used the patch per @thaJeztah instructions**

We do have a rare and random issue in our production line, that docker unix socket is hanging and receive UnixSocket time outs from our orchestrator, and docker daemon restart is required. However, we need all of our containers keep running during this procedure.

Thanks

*ERRO[0023] error waiting for container: unexpected EOF
**#556 (comment)

thaJeztah · 2019-06-13T11:35:18Z

I'm not sure actually if interactive containers are supported during live-restore; I seem to recall that wasn't supported, but I don't see a mention in the docs (https://docs.docker.com/config/containers/live-restore/). Did that work with older versions?

@crosbymichael might know from the top of his head; that looks to be a separate issue though

remoteweb · 2019-06-15T12:13:55Z

@thaJeztah, I realised container (upon docker restart) exits only when running a docker interactive command bash/sh execution (docker exec -it).

So, all main processes are still running after the restart so there is no issue related with the ticket (my bad for this).

However, our main problem is still there as even a docker daemon restart doesn't help the hanging API Client caused by a hanging container which even after a daemon restart fails to respond on everything (stop, kill etc) , very buggy situation. I ll try to find a more suitable ticket or create a new one.

arpit-sardhana · 2020-03-24T01:04:47Z

Issue still exists, i am testing with docker18.09.1 in centos 7.7. Even after setting containerd in docker.service as well as puting After=containerd.service in systemd file. live restore is not working as expected. Is there any expected caveat. Also in which point release this issue is fixed

ovidiucp · 2020-09-29T05:20:12Z

This issue is still present in the latest CE version:

Docker version 19.03.13, build 4484c46d9d

@seemethere should I open a new bug report for this, or can you reopen this one?

corbin-coleman mentioned this issue Jan 14, 2019

[master] [TAR-303] Start docker.service after containerd.service docker/docker-ce-packaging#290

Merged

thaJeztah mentioned this issue Jan 25, 2019

[18.09 backport] Start docker.service after containerd.service docker/docker-ce-packaging#294

Merged

This was referenced Feb 11, 2019

systemd: set --containerd socket patch to prevent race-condition docker/docker-ce-packaging#297

Merged

[18.09 backport] systemd: set --containerd socket patch to prevent race-condition docker/docker-ce-packaging#298

Merged

seemethere closed this as completed in docker/docker-ce-packaging#297 Feb 13, 2019

thaJeztah mentioned this issue Sep 30, 2020

Docker 18.09.5-3 fails during startup with message timeout waiting for containerd to start moby/moby#40292

Closed

thaJeztah mentioned this issue Jun 16, 2021

Windows CI: Add support for testing with containerd moby/moby#41479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

live-restore does not work on latest version 18.09.1 #556

live-restore does not work on latest version 18.09.1 #556

BSWANG commented Jan 14, 2019 •

edited

Loading

thaJeztah commented Jan 14, 2019

thaJeztah commented Jan 14, 2019 •

edited

Loading

BSWANG commented Jan 14, 2019

crosbymichael commented Jan 14, 2019

seemethere commented Jan 14, 2019

corbin-coleman commented Jan 14, 2019

buck2202 commented Jan 25, 2019

thaJeztah commented Jan 25, 2019 •

edited

Loading

buck2202 commented Jan 25, 2019

buck2202 commented Jan 26, 2019

thaJeztah commented Jan 28, 2019

buck2202 commented Feb 1, 2019 •

edited

Loading

thaJeztah commented Feb 1, 2019

buck2202 commented Feb 9, 2019 •

edited

Loading

thaJeztah commented Feb 11, 2019

remoteweb commented Jun 10, 2019 •

edited

Loading

thaJeztah commented Jun 13, 2019

remoteweb commented Jun 15, 2019

arpit-sardhana commented Mar 24, 2020 •

edited

Loading

ovidiucp commented Sep 29, 2020

live-restore does not work on latest version 18.09.1 #556

live-restore does not work on latest version 18.09.1 #556

Comments

BSWANG commented Jan 14, 2019 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the behavior

thaJeztah commented Jan 14, 2019

thaJeztah commented Jan 14, 2019 • edited Loading

BSWANG commented Jan 14, 2019

crosbymichael commented Jan 14, 2019

seemethere commented Jan 14, 2019

corbin-coleman commented Jan 14, 2019

buck2202 commented Jan 25, 2019

thaJeztah commented Jan 25, 2019 • edited Loading

buck2202 commented Jan 25, 2019

buck2202 commented Jan 26, 2019

thaJeztah commented Jan 28, 2019

Option 1 - using the daemon.json configuration file;

Option 2. - using the systemd override file:

buck2202 commented Feb 1, 2019 • edited Loading

thaJeztah commented Feb 1, 2019

buck2202 commented Feb 9, 2019 • edited Loading

thaJeztah commented Feb 11, 2019

remoteweb commented Jun 10, 2019 • edited Loading

thaJeztah commented Jun 13, 2019

remoteweb commented Jun 15, 2019

arpit-sardhana commented Mar 24, 2020 • edited Loading

ovidiucp commented Sep 29, 2020

BSWANG commented Jan 14, 2019 •

edited

Loading

thaJeztah commented Jan 14, 2019 •

edited

Loading

thaJeztah commented Jan 25, 2019 •

edited

Loading

Option 1 - using the `daemon.json` configuration file;

buck2202 commented Feb 1, 2019 •

edited

Loading

buck2202 commented Feb 9, 2019 •

edited

Loading

remoteweb commented Jun 10, 2019 •

edited

Loading

arpit-sardhana commented Mar 24, 2020 •

edited

Loading