docker-compose integration tests not cleaning up on failures #5948

anchal-agrawal · 2017-08-08T04:24:57Z

Seen in build https://ci.vcna.io/vmware/vic/12992:

The Compose Up with link test failed before getting to the docker-compose down step, thereby leaving orphan containers connected to the vic_default network, which caused subsequent docker-compose down commands to fail with

...
Removing network vic_default
vic_default has active endpoints

Proposed solutions:

docker-compose integration test suites use a Test Teardown Run Keyword If Test Failed step to clean up orphan containers if a test fails before the cleanup step (docker-compose down).
See if there's a docker-compose command option to make orphan containers attached to networks non-fatal for docker-compose down.

The text was updated successfully, but these errors were encountered:

chengwang86 · 2017-08-29T14:56:41Z

Also seen in ci build 13443.

The last cmd that was executed before the failed sub-test is:

docker-compose -H dhcp-192-168-31-157.ci.drone.local:2376 --file basic-compose.yml stop 2>&1
${rc} = 0
${output} = Stopping vic_redis_1 ... 
Stopping vic_redis_1 ... done

The other containerVM vic_web_1 was not stopped and it become an orphan container.

I was wondering why we didn't run docker-compose down after docker-compose stop. stop will only stop the containers but down can also remove the containers. This might be a workaround.

However, even if we don't run docker-compose down and we had orphan containers before the failed sub-test, this failure should not occur.

chengwang86 · 2017-08-29T16:37:47Z

I ran the following manually to create a case where we had an orphan container before docker-compose down

docker-compose up -d
docker stop vic_web_1 (so vic_web_1 was stopped but vic_redis_1 was up)
docker-compose down

and it still worked.

If an orphan container is originally created by docker-compose, it should not fail docker-compose down.

However, if it is created outside of docker-compose and it connects to the network created by compose, docker-compose down will and should fail for sure.

In this failure, the network vic_default is created by compose and all the orphan containers (if any) should have also been created by compose. So docker-compose down should not have failed here.

mhagen-vmware · 2017-08-30T14:21:05Z

Seeing this on quite a few builds now. @chengwang86 @anchal-agrawal Can you guys dig into this more? We need to cut this instability out of our builds ASAP. please.

chengwang86 · 2017-08-30T14:39:31Z

Also seen in ci builds 13468 and 13471.

chengwang86 · 2017-08-30T16:30:39Z

After docker-compose down, I find the following when cleaning up orphan containers:

docker -H 192.168.213.205:2376 --tls ps -a 2>&1
${output} = CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

docker -H 192.168.213.205:2376 --tls network inspect vic_default 2>&1
[
    {
        "Name": "vic_default",
        "Id": "f68a20e382fd0c0d9d7574d2d670de206f23e7b9d73907745d8d59ef14707b3b",
        "Created": "2017-08-30T16:06:21.055005304Z",
        "Scope": "",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Containers": {
            "725c4c4896113d3a6b5832245c4c15171f288286d0c7d27177b4882318c0b60b": {
                "Name": "vic_redis_1",
                "EndpointID": "725c4c4896113d3a6b5832245c4c15171f288286d0c7d27177b4882318c0b60b",
                "MacAddress": "",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

docker-compose down will call docker stop followed by docker rm to delete the containers first (e.g., Calling POST /v1.22/containers/container_id/stop?t=10 and then Calling DELETE /v1.22/containers/container_id?force=False&link=False&v=False). Then it will remove the network (vic_default in this case).

However, docker network inspect vic_default shows that vic_redis_1 is still running, although docker ps -a shows no containers. Therefore, docker-compose down failed due to network has active endpoint.

side note: The output of docker ps -a after docker-compose down is not always empty. Sometimes it showed vic_redis_1 or vic_web_1, which also indicates that the container was not removed successfully by docker-compose down.

chengwang86 · 2017-08-30T21:23:35Z

Here is how doker-compose down works:

def down(self, remove_image_type, include_volumes, remove_orphans=False):
        self.stop(one_off=OneOffFilter.include)
        self.find_orphan_containers(remove_orphans)
        self.remove_stopped(v=include_volumes, one_off=OneOffFilter.include)

        self.networks.remove()

        if include_volumes:
            self.volumes.remove()

        self.remove_images(remove_image_type)

chengwang86 · 2017-08-31T03:54:44Z

Discussed with @hmahmood about this issue. It turns out that in the sub-test of compose kill,

Compose kill
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml up -d
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml kill redis
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml down

kill redis kills the container vic_redis_1 (exited(0)) but sometimes vic_redis_1 still shows up in the output of docker network inspect vic_default. Therefore, docker-compose down would fail.

Docker-compose sends Calling POST /v1.22/containers/4e50125ea3fa960499f411ced1d72c3221ccd555ec126b0c38baa909bf06971e/kill to the vic engine to kill vic_redis_1.
However, our containerKill() does not explicitly remove the container from the network endpoint. Only when a vsphere event which indicates that the container is powered off does the vic engine unbind the container from the network.
Therefore, if the ESXi host is overloaded, the vsphere event might take some time to reach the portlayer and docker network inspect would show the killed container as an active endpoint during that time.

chengwang86 · 2017-08-31T15:54:22Z

From the logs of build 13471, I didn't find unbindcontainer() being called for vic_redis_1. It looks like we never received a vsphere event of containerPoweredOff for vic_redis_1 . So I'm afraid that vsphere event may never appear and our fix for that compose test wouldn't work.

What we expect to see is

Aug 30 2017 08:59:11.445Z DEBUG Found 2 subscribers to id: 0 - events.ContainerEvent: Container eb7993420eda177079c6ddc7f966445d09bb65bf8500a31f6d35149e8c39d237 PoweredOff
Aug 30 2017 08:59:11.445Z DEBUG Event manager calling back to PLE-605bf48ea81e635d0e0bc0222babd569d31d276ed4126e1bfeb8be5347665b44
Aug 30 2017 08:59:11.445Z DEBUG Event manager calling back to netCtx(0xc42082aea0)
... ...
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/network.(*Context).UnbindContainer:866]
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/network.(*Context).container:852]
Aug 31 2017 14:31:18.685Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/network.(*Context).container:852] [9.046µs] 
Aug 31 2017 14:31:18.685Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/network.(*Context).UnbindContainer:866] [47.718µs] 
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.Commit:35]

where eb799... is the redis container. We've got stopped and removed events. But event PoweredOff didn't show up in the log

chengwang86 · 2017-08-31T20:16:48Z

We did receive the VM poweredoff event from vsphere.

It turns out that there is a race condition here. docker kill sends SIGKILL to the container. Then docker-compose down removes the container before the PL receives the VM poweredoff event. When the PL processes the PoweredOff, the container is already removed from cache, so it will not generate a containerPoweredOff event ... therefore, unBindContainer() is not called ...

chengwang86 · 2017-08-31T21:54:02Z

Some comments from @hickeng
`we could either:

not ignore the power off and treat it as a trigger for doing garbage collection of the endpoints
start receiving vmdeleted events and remove the endpoint in that scenario as well

We will have a similar problem with a dangling endpoint if we unregister a containerVM - we won't ever get a power off or deleted event - so we should also consider validating any remaining endpoints during a delete operation`

chengwang86 · 2017-09-01T14:44:20Z

The workaround:

reboot the VCH would help if docker-compose down fails. After reboot, we should be able to remove the network.

The fix will involve code changes in two places:

Upon receiving container removed event in the portlayer, we should remove the container from the portlayer network (context) cache, regardless of whether the container exists in the portlayer cache or not.
When we remove a network by network rm, if we find any dangling containerVMs on that network (the containerVm only exists in the network context cache but not in the portlayer container cache), remove the container from the network contex cache.

anchal-agrawal added area/docker-compose component/test Tests not covered by a more specific component label priority/p2 labels Aug 8, 2017

chengwang86 mentioned this issue Aug 30, 2017

[specific ci=3-03-Docker-Compose-Basic] Wait until container removed from network after docker-compose kill #6158

Closed

chengwang86 added this to the Sprint 16 Container Ops milestone Aug 30, 2017

chengwang86 self-assigned this Aug 30, 2017

zjs added the team/container label Aug 30, 2017

chengwang86 mentioned this issue Aug 31, 2017

[specific ci=3-03-Docker-Compose-Basic] Unbind container from network during docker kill #6184

Closed

This was referenced Sep 1, 2017

[skip ci] Remove dangling container during network rm #6190

Closed

[specific ci=1-14-Docker-Kill] publish containerPoweredOff event after kill #6206

Closed

[specific ci=3-03-Docker-Compose-Basic] Unbind container in containerRm #6209

Merged

chengwang86 closed this as completed in #6209 Sep 7, 2017

sflxn mentioned this issue Nov 28, 2017

Performance regression in remove containers operation #6627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker-compose integration tests not cleaning up on failures #5948

docker-compose integration tests not cleaning up on failures #5948

anchal-agrawal commented Aug 8, 2017

chengwang86 commented Aug 29, 2017 •

edited

Loading

chengwang86 commented Aug 29, 2017

mhagen-vmware commented Aug 30, 2017

chengwang86 commented Aug 30, 2017

chengwang86 commented Aug 30, 2017 •

edited

Loading

chengwang86 commented Aug 30, 2017

chengwang86 commented Aug 31, 2017

chengwang86 commented Aug 31, 2017 •

edited

Loading

chengwang86 commented Aug 31, 2017

chengwang86 commented Aug 31, 2017

chengwang86 commented Sep 1, 2017

docker-compose integration tests not cleaning up on failures #5948

docker-compose integration tests not cleaning up on failures #5948

Comments

anchal-agrawal commented Aug 8, 2017

chengwang86 commented Aug 29, 2017 • edited Loading

chengwang86 commented Aug 29, 2017

mhagen-vmware commented Aug 30, 2017

chengwang86 commented Aug 30, 2017

chengwang86 commented Aug 30, 2017 • edited Loading

chengwang86 commented Aug 30, 2017

chengwang86 commented Aug 31, 2017

chengwang86 commented Aug 31, 2017 • edited Loading

chengwang86 commented Aug 31, 2017

chengwang86 commented Aug 31, 2017

chengwang86 commented Sep 1, 2017

chengwang86 commented Aug 29, 2017 •

edited

Loading

chengwang86 commented Aug 30, 2017 •

edited

Loading

chengwang86 commented Aug 31, 2017 •

edited

Loading