Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker-compose integration tests not cleaning up on failures #5948

Closed
anchal-agrawal opened this issue Aug 8, 2017 · 11 comments
Closed

docker-compose integration tests not cleaning up on failures #5948

anchal-agrawal opened this issue Aug 8, 2017 · 11 comments
Assignees
Labels
area/docker-compose component/test Tests not covered by a more specific component label priority/p2 team/container

Comments

@anchal-agrawal
Copy link
Contributor

Seen in build https://ci.vcna.io/vmware/vic/12992:

screen shot 2017-08-07 at 23 14 21

The Compose Up with link test failed before getting to the docker-compose down step, thereby leaving orphan containers connected to the vic_default network, which caused subsequent docker-compose down commands to fail with

...
Removing network vic_default
vic_default has active endpoints

Proposed solutions:

  • docker-compose integration test suites use a Test Teardown Run Keyword If Test Failed step to clean up orphan containers if a test fails before the cleanup step (docker-compose down).
  • See if there's a docker-compose command option to make orphan containers attached to networks non-fatal for docker-compose down.
@anchal-agrawal anchal-agrawal added area/docker-compose component/test Tests not covered by a more specific component label priority/p2 labels Aug 8, 2017
@chengwang86
Copy link
Contributor

chengwang86 commented Aug 29, 2017

Also seen in ci build 13443.

The last cmd that was executed before the failed sub-test is:

docker-compose -H dhcp-192-168-31-157.ci.drone.local:2376 --file basic-compose.yml stop 2>&1
${rc} = 0
${output} = Stopping vic_redis_1 ... 
Stopping vic_redis_1 ... done

The other containerVM vic_web_1 was not stopped and it become an orphan container.

I was wondering why we didn't run docker-compose down after docker-compose stop. stop will only stop the containers but down can also remove the containers. This might be a workaround.

However, even if we don't run docker-compose down and we had orphan containers before the failed sub-test, this failure should not occur.

@chengwang86
Copy link
Contributor

I ran the following manually to create a case where we had an orphan container before docker-compose down

docker-compose up -d
docker stop vic_web_1 (so vic_web_1 was stopped but vic_redis_1 was up)
docker-compose down

and it still worked.

If an orphan container is originally created by docker-compose, it should not fail docker-compose down.

However, if it is created outside of docker-compose and it connects to the network created by compose, docker-compose down will and should fail for sure.

In this failure, the network vic_default is created by compose and all the orphan containers (if any) should have also been created by compose. So docker-compose down should not have failed here.

@mhagen-vmware
Copy link
Contributor

Seeing this on quite a few builds now. @chengwang86 @anchal-agrawal Can you guys dig into this more? We need to cut this instability out of our builds ASAP. please.

@chengwang86
Copy link
Contributor

Also seen in ci builds 13468 and 13471.

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 30, 2017

After docker-compose down, I find the following when cleaning up orphan containers:

docker -H 192.168.213.205:2376 --tls ps -a 2>&1
${output} = CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
docker -H 192.168.213.205:2376 --tls network inspect vic_default 2>&1
[
    {
        "Name": "vic_default",
        "Id": "f68a20e382fd0c0d9d7574d2d670de206f23e7b9d73907745d8d59ef14707b3b",
        "Created": "2017-08-30T16:06:21.055005304Z",
        "Scope": "",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Containers": {
            "725c4c4896113d3a6b5832245c4c15171f288286d0c7d27177b4882318c0b60b": {
                "Name": "vic_redis_1",
                "EndpointID": "725c4c4896113d3a6b5832245c4c15171f288286d0c7d27177b4882318c0b60b",
                "MacAddress": "",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

docker-compose down will call docker stop followed by docker rm to delete the containers first (e.g., Calling POST /v1.22/containers/container_id/stop?t=10 and then Calling DELETE /v1.22/containers/container_id?force=False&link=False&v=False). Then it will remove the network (vic_default in this case).

However, docker network inspect vic_default shows that vic_redis_1 is still running, although docker ps -a shows no containers. Therefore, docker-compose down failed due to network has active endpoint.


side note: The output of docker ps -a after docker-compose down is not always empty. Sometimes it showed vic_redis_1 or vic_web_1, which also indicates that the container was not removed successfully by docker-compose down.

@chengwang86 chengwang86 added this to the Sprint 16 Container Ops milestone Aug 30, 2017
@chengwang86 chengwang86 self-assigned this Aug 30, 2017
@chengwang86
Copy link
Contributor

Here is how doker-compose down works:

def down(self, remove_image_type, include_volumes, remove_orphans=False):
        self.stop(one_off=OneOffFilter.include)
        self.find_orphan_containers(remove_orphans)
        self.remove_stopped(v=include_volumes, one_off=OneOffFilter.include)

        self.networks.remove()

        if include_volumes:
            self.volumes.remove()

        self.remove_images(remove_image_type)

@chengwang86
Copy link
Contributor

Discussed with @hmahmood about this issue. It turns out that in the sub-test of compose kill,

Compose kill
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml up -d
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml kill redis
    docker-compose %{COMPOSE-PARAMS} -f basic-compose.yml down

kill redis kills the container vic_redis_1 (exited(0)) but sometimes vic_redis_1 still shows up in the output of docker network inspect vic_default. Therefore, docker-compose down would fail.

Docker-compose sends Calling POST /v1.22/containers/4e50125ea3fa960499f411ced1d72c3221ccd555ec126b0c38baa909bf06971e/kill to the vic engine to kill vic_redis_1.
However, our containerKill() does not explicitly remove the container from the network endpoint. Only when a vsphere event which indicates that the container is powered off does the vic engine unbind the container from the network.
Therefore, if the ESXi host is overloaded, the vsphere event might take some time to reach the portlayer and docker network inspect would show the killed container as an active endpoint during that time.

@chengwang86
Copy link
Contributor

chengwang86 commented Aug 31, 2017

From the logs of build 13471, I didn't find unbindcontainer() being called for vic_redis_1. It looks like we never received a vsphere event of containerPoweredOff for vic_redis_1 . So I'm afraid that vsphere event may never appear and our fix for that compose test wouldn't work.

What we expect to see is

Aug 30 2017 08:59:11.445Z DEBUG Found 2 subscribers to id: 0 - events.ContainerEvent: Container eb7993420eda177079c6ddc7f966445d09bb65bf8500a31f6d35149e8c39d237 PoweredOff
Aug 30 2017 08:59:11.445Z DEBUG Event manager calling back to PLE-605bf48ea81e635d0e0bc0222babd569d31d276ed4126e1bfeb8be5347665b44
Aug 30 2017 08:59:11.445Z DEBUG Event manager calling back to netCtx(0xc42082aea0)
... ...
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/network.(*Context).UnbindContainer:866]
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/network.(*Context).container:852]
Aug 31 2017 14:31:18.685Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/network.(*Context).container:852] [9.046µs] 
Aug 31 2017 14:31:18.685Z DEBUG [ END ] [github.com/vmware/vic/lib/portlayer/network.(*Context).UnbindContainer:866] [47.718µs] 
Aug 31 2017 14:31:18.685Z DEBUG [BEGIN] [github.com/vmware/vic/lib/portlayer/exec.Commit:35] 

where eb799... is the redis container. We've got stopped and removed events. But event PoweredOff didn't show up in the log

@chengwang86
Copy link
Contributor

We did receive the VM poweredoff event from vsphere.

It turns out that there is a race condition here. docker kill sends SIGKILL to the container. Then docker-compose down removes the container before the PL receives the VM poweredoff event. When the PL processes the PoweredOff, the container is already removed from cache, so it will not generate a containerPoweredOff event ... therefore, unBindContainer() is not called ...

@chengwang86
Copy link
Contributor

Some comments from @hickeng
`we could either:

  • not ignore the power off and treat it as a trigger for doing garbage collection of the endpoints
  • start receiving vmdeleted events and remove the endpoint in that scenario as well

We will have a similar problem with a dangling endpoint if we unregister a containerVM - we won't ever get a power off or deleted event - so we should also consider validating any remaining endpoints during a delete operation`

@chengwang86
Copy link
Contributor

The workaround:

  • reboot the VCH would help if docker-compose down fails. After reboot, we should be able to remove the network.

The fix will involve code changes in two places:

  1. Upon receiving container removed event in the portlayer, we should remove the container from the portlayer network (context) cache, regardless of whether the container exists in the portlayer cache or not.
  2. When we remove a network by network rm, if we find any dangling containerVMs on that network (the containerVm only exists in the network context cache but not in the portlayer container cache), remove the container from the network contex cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment