Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live-restore does not work on latest version 18.09.1 #556

Closed
2 of 3 tasks
BSWANG opened this issue Jan 14, 2019 · 20 comments · Fixed by docker/docker-ce-packaging#297
Closed
2 of 3 tasks

live-restore does not work on latest version 18.09.1 #556

BSWANG opened this issue Jan 14, 2019 · 20 comments · Fixed by docker/docker-ce-packaging#297

Comments

@BSWANG
Copy link

BSWANG commented Jan 14, 2019

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Expected behavior

Docker don't stop containers when restart docker daemon on "live-restore" is open.

Actual behavior

Docker stop the containers when restart docker daemon.

[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker run -itd --name test-live-restore busybox
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
57c14dd66db0: Pull complete
Digest: sha256:7964ad52e396a6e045c39b5a44438424ac52e12e4d5a25d94895f2058cb863a0
Status: Downloaded newer image for busybox:latest
5c5530032d8327c96eb5db163830ee08b7853055d3dfc5c5fb7277da3dc2df91
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
5c5530032d83        busybox             "sh"                20 seconds ago      Up 19 seconds                           test-live-restore
[root@iZbp10z0xdiqguldb5kg9vZ ~]# systemctl restart docker
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                        PORTS               NAMES
5c5530032d83        busybox             "sh"                48 seconds ago      Exited (255) 15 seconds ago                       test-live-restore
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker info | grep live-restore
WARNING: bridge-nf-call-ip6tables is disabled
[root@iZbp10z0xdiqguldb5kg9vZ ~]# docker info | grep -i  live
WARNING: bridge-nf-call-ip6tables is disabled
Live Restore Enabled: true

Steps to reproduce the behavior

  1. Set the docker live-restore open
  2. Create and start a test container
  3. Restart docker daemon by systemctl restart docker

Output of docker version:

Client:
 Version:           18.09.1
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        4c52b90
 Built:             Wed Jan  9 19:35:01 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       4c52b90
  Built:            Wed Jan  9 19:06:30 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 11
Server Version: 18.09.1
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 96ec2177ae841256168fcf76954f7177af9446eb
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.2.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.639GiB
Name: iZbp10z0xdiqguldb5kg9vZ
ID: 7VBE:VOCD:XSDU:FDNH:P6IB:EKL7:3J5U:CCDW:UULW:R2EF:S7ER:EENA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://pqbap4ya.mirror.aliyuncs.com/
Live Restore Enabled: true
Product License: Community Engine

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.)
AlibabaCloud

Actually, the process of container is not stopped, but dockerd/containerd can not take over it.

image

Then I create a container after restart. The container's containerd-shim arguments is not same to the first container.

root     22306     1  0 09:12 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/5c5530032d8327c96eb5db163830ee08b7853055d3dfc5c5fb7277da3dc2df91 -address /var/run/docker/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup
root     22782 10917  0 09:21 ?        00:00:00 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/83ee16c7ed3494d7c2850391983bcc14b053e3ea0545589700ec3c226fda282d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc -systemd-cgroup

The workdir diff of two containerd-shim:
/var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/...
/var/lib/containerd/io.containerd.runtime.v1.linux/...

@thaJeztah
Copy link
Member

ping @crosbymichael @seemethere - I think this might be related to socket activation. (as I recall it, live-restore was the original reason for removing it in the RPM packages)

/cc @andrewhsu

@thaJeztah
Copy link
Member

thaJeztah commented Jan 14, 2019

Actually .. wondering if it's a race condition between the containerd service starting, and the dockerd daemon starting it as a child process; I had another issue mentioning that 🤔

@BSWANG
Copy link
Author

BSWANG commented Jan 14, 2019

@thaJeztah Thanks, I just test manually start containerd service before dockerd service, this issue not happen.
It should be a race condition.

@crosbymichael
Copy link

I think this happens now because the docker service depends on the containerd service or has an After= on the containerd service.

@seemethere
Copy link

I think that if we add an After=containerd.service that would solve for this use case since it would only start the docker service after containerd has successfully started.

@corbin-coleman
Copy link

This PR should fix the issue docker/docker-ce-packaging#290

It's going into the master branch first, so it'll appear in the next nightly release after the PR is merged in.

@buck2202
Copy link

This doesn't seem to reliably fix the issue for me. I added containerd.service to the After= line in docker.service, but I'm still intermittently seeing the behavior reported above. Docker reports that the container has exited with code 255, but (some) processes formerly associated with the container are still running. When this happens to me, the entrypoint (a bash wrapper to an executable that sets paths and such) actually does die in the service restart, but the executable that it calls is still running.

I'm using the google cloud ubuntu xenial image, with docker version 18.09.1 build 4c52b90

@thaJeztah
Copy link
Member

thaJeztah commented Jan 25, 2019

It's not yet in the 18.09.1 packages (opened a backport in docker/docker-ce-packaging#294)

However, you can use an override file to add the change (without modifying the original docker.service; which you should never do);

The easiest way to create an override file is using systemctl edit;

  1. Run systemctl edit docker.service. This creates an empty override file, and opens it in your editor;

    systemctl edit docker.service
  2. Edit the file to specify the new After (the After option can be specified multiple times, so there's no need to include what's already in the main docker.service file). The file should look something like this:

    [Unit]
    # Make sure containerd is running before starting dockerd
    After=containerd.service
    
  3. Reload the systemd configuration with systemctl daemon-reload

  4. Verify if the changes look ok, using systemctl cat docker.service. This will show the contents of both the main docker.service unit, and any override file that exists. In the output below you see both files are loaded, and the override.conf appends containerd.service to the After=:

    # /lib/systemd/system/docker.service
    [Unit]
    ...
    BindsTo=containerd.service
    After=network-online.target firewalld.service
    
    ...
    
    
    # /etc/systemd/system/docker.service.d/override.conf
    [Unit]
    # Make sure containerd is running before starting dockerd
    After=containerd.service
    
  5. To confirm that the effective After looks like you want, use systemctl show; you can filter to just the property you're interested in (After in this case):

    systemctl show --property=After docker.service
    
    After=network-online.target firewalld.service system.slice docker.socket systemd-journald.socket basic.target sysinit.target containerd.service
  6. Restart the docker service to make the new change active systemctl restart docker.service

@buck2202
Copy link

Ok, I did modify the docker.service file directly to match the referenced PR above, and held updates to the docker-ce package in dpkg to prevent any conflicts while I waited for the backport. To be clear, it does seem to be generally better, but once (out of 10s of service restarts across multiple hosts) I saw it lose track of all containers again when the host system was under high load.

It seemed possible to me that some less-likely race condition might still exist, but I will change my approach and keep an eye on it.

@buck2202
Copy link

Just confirming that I do still see this issue happen sporadically after restoring the original docker.service and creating the override as you described.

@thaJeztah
Copy link
Member

Ok, so there's one more option to prevent any possible race condition.

The daemon has a --containerd option that allows setting the path of the containerd socket. If that option is set (either as flag or through the daemon.json configuration file), the daemon will disable monitoring of containerd, and in cases where the containerd is not present, "fail" instead of trying to start a containerd instance as child-process.

So, there's two approaches to configure this. Generally, the daemon.json approach is recommended, as it will also be used when manually starting the dockerd process, whereas the systemd approach will only take effect if the daemon is started through systemd.

You cannot use both methods - you need to pick one. Setting the option both as a daemon flag (--containerd), and in the daemon.json configuration file will result in a "conflicting options", and the daemon will refuse to start.

Option 1 - using the daemon.json configuration file;

  1. Create a /etc/docker/daemon.json file if it does not yet exist

  2. Edit the configuration file, and add the "containerd" setting

    The file must be valid JSON; if you have no other configurations set in your daemon.json, the file will look like this:

    {"containerd": "/run/containerd/containerd.sock"}
  3. Restart the docker service

    systemctl restart docker.service

Option 2. - using the systemd override file:

  1. Run systemctl edit docker.service. This opens the override file in your editor;

  2. Edit the file to specify the new ExecStart Service options. To override the existing ExecStart, the original one first has to be "reset" (otherwise another ExecStart is added). If you made the changes from my previous comment, the file should look something like this:

    [Unit]
    # Make sure containerd is running before starting dockerd
    After=containerd.service
    
    [Service]
    # Disable managing containerd, and require the containerd socket
    # to be present at the given location
    
    # First reset the existing ExecStart
    ExecStart=
    
    # Then set the new ExecStart
    ExecStart=/usr/bin/dockerd -H fd://  --containerd=/run/containerd/containerd.sock
    
  3. Reload the systemd configuration with systemctl daemon-reload

  4. Confirm that the configuration looks as expected

    systemctl cat docker.service | grep ExecStart=
    
    ExecStart=/usr/bin/dockerd -H fd://
    ExecStart=
    ExecStart=/usr/bin/dockerd -H fd://  --containerd=/run/containerd/containerd.sock
  5. Restart the docker service to make the new change active systemctl restart docker.service

@buck2202
Copy link

buck2202 commented Feb 1, 2019

I pushed these changes to 20+ VMs through the systemd override file, and so far (~48h), I've had no cases of containers getting lost by docker during service restarts. The After= addition decreased the frequency of the issue, but was not sufficient on its own in my case.

@thaJeztah
Copy link
Member

Thanks @buck2202

@seemethere ^^ looks like we may want to consider using the --containerd option instead.

@buck2202
Copy link

buck2202 commented Feb 9, 2019

Just checking back in--there were zero issues in ~1.5 weeks of continuous use across ~30 VMs with both After= and ExecStart= configured in the systemd override.

Daemon restarts would have been occurring 1-2 times/hour on each VM, and I had previously setup scripts to log all instances where docker reported a container had exited but the container's executable was still active.

@thaJeztah
Copy link
Member

Thanks for testing!

I opened docker/docker-ce-packaging#297 and docker/docker-ce-packaging#298 for consideration

@remoteweb
Copy link

remoteweb commented Jun 10, 2019

Sorry if this has been covered by @buck2202 already, however i am testing docker v. 18.09.2 with an interactive bash container and when i run systemctl restart docker, bash exits with an error message*, even though docker ps reporting the container never exited.

I have used the patch per @thaJeztah instructions**

We do have a rare and random issue in our production line, that docker unix socket is hanging and receive UnixSocket time outs from our orchestrator, and docker daemon restart is required. However, we need all of our containers keep running during this procedure.

Thanks

*ERRO[0023] error waiting for container: unexpected EOF
**#556 (comment)

@thaJeztah
Copy link
Member

I'm not sure actually if interactive containers are supported during live-restore; I seem to recall that wasn't supported, but I don't see a mention in the docs (https://docs.docker.com/config/containers/live-restore/). Did that work with older versions?

@crosbymichael might know from the top of his head; that looks to be a separate issue though

@remoteweb
Copy link

@thaJeztah, I realised container (upon docker restart) exits only when running a docker interactive command bash/sh execution (docker exec -it).

So, all main processes are still running after the restart so there is no issue related with the ticket (my bad for this).

However, our main problem is still there as even a docker daemon restart doesn't help the hanging API Client caused by a hanging container which even after a daemon restart fails to respond on everything (stop, kill etc) , very buggy situation. I ll try to find a more suitable ticket or create a new one.

@arpit-sardhana
Copy link

arpit-sardhana commented Mar 24, 2020

Issue still exists, i am testing with docker18.09.1 in centos 7.7. Even after setting containerd in docker.service as well as puting After=containerd.service in systemd file. live restore is not working as expected. Is there any expected caveat. Also in which point release this issue is fixed

@ovidiucp
Copy link

This issue is still present in the latest CE version:

Docker version 19.03.13, build 4484c46d9d

@seemethere should I open a new bug report for this, or can you reopen this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants