[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

nhtruong · 2022-10-06T17:10:57Z

What is the problem you are trying to solve?

There is a bug from openseach-build that causes the OpenSearch container to occasionally fail as soon as it's booted up. This has caused Integration tests on the Javascript client to fail intermittently, and I've been informed that other repos' workflows are also facing this issue.

Even though the chance of OpenSearch container crashing is only 1 out of 50 (per my benchmarks running thousands of such jobs), the chance of this bug failing a workflow is quite high when you run compatibility tests that stand up a few dozen of OpenSearch instances. This has caused every other Push/Pull-Request to fail the Action check, and requires an admin to rerun the failed jobs. This is not a good experience for the contributors nor the admins

What else have you found out about this problem?

I've only seen this happen on workflows that use docker-compose up. Workflows that run gradlew and docker run, does not seem to be affected. [Verification needed]
This issue only happens AFTER the container has started successfully. So, you won't see any error from Docker during docker-compose.
This bug can either cause the container to crash, or terminate the OpenSearch service within the container. The container will still appear to be running just fine in the latter.
When this happens, you will see the following messages in the container's logs:
```
Killing opensearch process 10
Killing performance analyzer process 11
```

What are you proposing?

Simply restarting the container will bring OpenSearch service back to life. So, there are a couple of workarounds that we can apply to these flaky workflows:

1. Grep for the Killing message:

Run the following script after the container's stood up (You can add this in the make file after docker-compose up):

for i in {1..3}; do \
	sleep 30; \
	if docker logs opensearch_opensearch_1 --tail 10 | grep -q "Killing opensearch process"; then \
		echo "Restarting OpenSearch Container..."; \
		docker restart opensearch_opensearch_1; \
	else break; fi; \
done;
sleep 30;

This is a quick and dirty workaround. You can just copy-paste this script to your workflow-step/make-file (after replacing opensearch_opensearch_1 with your container's name of course), and it will just work.

2. Autoheal + Auto Restart:

When this bug crashes the container, we can use docker's auto restart feature to bring it back up. Add restart: always to the service definition in your docker file:
```
services:
  opensearch:
    restart: always
```
When this bug crashes the OpenSearch service but leaves the container running, we can use a combination of Healthcheck and Autoheal to restart the container when it's unhealthy:
- Add Healthcheck to Dockerfile. For example:
```
HEALTHCHECK --start-period=20s --interval=5s --retries=2 --timeout=1s \
  CMD if [ "$SECURE_INTEGRATION" != "true" ]; \
    then curl --fail localhost:9200/_cat/health; \
    else curl --fail -k https:/localhost:9200/_cat/health -u admin:admin; fi
```
  Note that the HealthCheck CMD is specific to your OpenSearch instance, so you will have to figure it out for your own workflow. It also takes some time for the container to be ready for the first HealthCheck. So, I'd recommend a --start-period of at least 20 seconds. If you're using any env vars in your CMD, remember to define them in services.<service_name>.environment, instead of services.<service_name>.build.args, in the docker-compose file as the command is NOT run during docker-build.
- Add Autoheal as a service running along side Opensearch container. For example to define Autoheal in docker-compose.yml:
```
services:
   opensearch:
   ...
   autoheal:
     restart: always
     image: willfarrell/autoheal
     environment:
       - AUTOHEAL_CONTAINER_LABEL=all
       - AUTOHEAL_START_PERIOD=30
       - AUTOHEAL_INTERVAL=5
       - AUTOHEAL_DEFAULT_STOP_TIMEOUT=30
     volumes:
       - /var/run/docker.sock:/var/run/docker.sock
```
  In this example, I use AUTOHEAL_CONTAINER_LABEL=all, which means Autoheal will try to restart all unhealthy containers instead of only those with services.<service_name>.labels.autoheal=true. I opted not to use the label feature of autoheal because I couldn't make it to work consistently on my Ubuntu workflows (It works fine on my local Mac env). Also note that the AUTOHEAL_START_PERIOD and AUTOHEAL_DEFAULT_STOP_TIMEOUT should be greater than the time it takes for first possible unhealthy status to be reported.
- Add ample sleep time after OpenSearch and Autoheal containers are stood up so that the OpenSearch container can be restarted by Autoheal at least once, to avoid a race condition between these 2 containers and your tests. For this example, I'd use sleep 60;
This workaround is more involved, but it has the added benefit of also solving other kinds of intermittent failures, not just the one caused by said bug.

I benchmarked both solutions on over 700 jobs each, and they all passed.

The text was updated successfully, but these errors were encountered:

dblock · 2022-10-06T18:24:47Z

@nhtruong I really think we're wasting our time trying to retry restarting the containers, we should fix the root cause - want to try writing a matrix job that runs enough containers in a loop/parallel to reproduce this semi-consistently and collect logs from the opensearch instance that doesn't start? there's an error in there I'm almost sure

nhtruong · 2022-10-06T18:34:26Z

@dblock For sure. Lemme look for ways to grab better logs than the default container logs which only shows

  Killing opensearch process 10
  Killing performance analyzer process 11

dblock · 2022-10-11T21:27:42Z

@nhtruong So we like opensearch-project/opensearch-js#304? Let's document how to do that everywhere else? Can we reuse some of those GH workflows? Do we need a doc on integration testing?

nhtruong mentioned this issue Oct 6, 2022

[Bug]: Intermittent startup issues with Opensearch 1.3.3 Docker container opensearch-project/opensearch-build#2568

Closed

peternied mentioned this issue Feb 13, 2023

Document the purpose of this repository #81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

nhtruong commented Oct 6, 2022 •

edited

Loading

dblock commented Oct 6, 2022

nhtruong commented Oct 6, 2022 •

edited

Loading

dblock commented Oct 11, 2022

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

[PROPOSAL] Fix Flaky Github Actions That Use OpenSearch docker container #62

Comments

nhtruong commented Oct 6, 2022 • edited Loading

What is the problem you are trying to solve?

What else have you found out about this problem?

What are you proposing?

1. Grep for the Killing message:

2. Autoheal + Auto Restart:

dblock commented Oct 6, 2022

nhtruong commented Oct 6, 2022 • edited Loading

dblock commented Oct 11, 2022

nhtruong commented Oct 6, 2022 •

edited

Loading

nhtruong commented Oct 6, 2022 •

edited

Loading