You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even though the chance of OpenSearch container crashing is only 1 out of 50 (per my benchmarks running thousands of such jobs), the chance of this bug failing a workflow is quite high when you run compatibility tests that stand up a few dozen of OpenSearch instances. This has caused every other Push/Pull-Request to fail the Action check, and requires an admin to rerun the failed jobs. This is not a good experience for the contributors nor the admins
What else have you found out about this problem?
I've only seen this happen on workflows that use docker-compose up. Workflows that run gradlew and docker run, does not seem to be affected. [Verification needed]
This issue only happens AFTER the container has started successfully. So, you won't see any error from Docker during docker-compose.
This bug can either cause the container to crash, or terminate the OpenSearch service within the container. The container will still appear to be running just fine in the latter.
When this happens, you will see the following messages in the container's logs:
Killing opensearch process 10
Killing performance analyzer process 11
What are you proposing?
Simply restarting the container will bring OpenSearch service back to life. So, there are a couple of workarounds that we can apply to these flaky workflows:
1. Grep for the Killing message:
Run the following script after the container's stood up (You can add this in the make file after docker-compose up):
This is a quick and dirty workaround. You can just copy-paste this script to your workflow-step/make-file (after replacing opensearch_opensearch_1 with your container's name of course), and it will just work.
2. Autoheal + Auto Restart:
When this bug crashes the container, we can use docker's auto restart feature to bring it back up. Add restart: always to the service definition in your docker file:
services:
opensearch:
restart: always
When this bug crashes the OpenSearch service but leaves the container running, we can use a combination of Healthcheck and Autoheal to restart the container when it's unhealthy:
Add Healthcheck to Dockerfile. For example:
HEALTHCHECK --start-period=20s --interval=5s --retries=2 --timeout=1s \
CMD if [ "$SECURE_INTEGRATION" != "true" ]; \
then curl --fail localhost:9200/_cat/health; \
else curl --fail -k https:/localhost:9200/_cat/health -u admin:admin; fi
Note that the HealthCheck CMD is specific to your OpenSearch instance, so you will have to figure it out for your own workflow. It also takes some time for the container to be ready for the first HealthCheck. So, I'd recommend a --start-period of at least 20 seconds. If you're using any env vars in your CMD, remember to define them in services.<service_name>.environment, instead of services.<service_name>.build.args, in the docker-compose file as the command is NOT run during docker-build.
Add Autoheal as a service running along side Opensearch container. For example to define Autoheal in docker-compose.yml:
In this example, I use AUTOHEAL_CONTAINER_LABEL=all, which means Autoheal will try to restart all unhealthy containers instead of only those with services.<service_name>.labels.autoheal=true. I opted not to use the label feature of autoheal because I couldn't make it to work consistently on my Ubuntu workflows (It works fine on my local Mac env). Also note that the AUTOHEAL_START_PERIOD and AUTOHEAL_DEFAULT_STOP_TIMEOUT should be greater than the time it takes for first possible unhealthy status to be reported.
Add ample sleep time after OpenSearch and Autoheal containers are stood up so that the OpenSearch container can be restarted by Autoheal at least once, to avoid a race condition between these 2 containers and your tests. For this example, I'd use sleep 60;
This workaround is more involved, but it has the added benefit of also solving other kinds of intermittent failures, not just the one caused by said bug.
I benchmarked both solutions on over 700 jobs each, and they all passed.
The text was updated successfully, but these errors were encountered:
@nhtruong I really think we're wasting our time trying to retry restarting the containers, we should fix the root cause - want to try writing a matrix job that runs enough containers in a loop/parallel to reproduce this semi-consistently and collect logs from the opensearch instance that doesn't start? there's an error in there I'm almost sure
@nhtruong So we like opensearch-project/opensearch-js#304? Let's document how to do that everywhere else? Can we reuse some of those GH workflows? Do we need a doc on integration testing?
What is the problem you are trying to solve?
There is a bug from openseach-build that causes the OpenSearch container to occasionally fail as soon as it's booted up. This has caused Integration tests on the Javascript client to fail intermittently, and I've been informed that other repos' workflows are also facing this issue.
Even though the chance of OpenSearch container crashing is only 1 out of 50 (per my benchmarks running thousands of such jobs), the chance of this bug failing a workflow is quite high when you run compatibility tests that stand up a few dozen of OpenSearch instances. This has caused every other Push/Pull-Request to fail the Action check, and requires an admin to rerun the failed jobs. This is not a good experience for the contributors nor the admins
What else have you found out about this problem?
docker-compose up
. Workflows that rungradlew
anddocker run
, does not seem to be affected. [Verification needed]docker-compose
.What are you proposing?
Simply restarting the container will bring OpenSearch service back to life. So, there are a couple of workarounds that we can apply to these flaky workflows:
1. Grep for the Killing message:
Run the following script after the container's stood up (You can add this in the make file after
docker-compose up
):This is a quick and dirty workaround. You can just copy-paste this script to your workflow-step/make-file (after replacing
opensearch_opensearch_1
with your container's name of course), and it will just work.2. Autoheal + Auto Restart:
When this bug crashes the container, we can use docker's auto restart feature to bring it back up. Add
restart: always
to the service definition in your docker file:When this bug crashes the OpenSearch service but leaves the container running, we can use a combination of Healthcheck and Autoheal to restart the container when it's unhealthy:
Add Healthcheck to Dockerfile. For example:
Note that the HealthCheck
CMD
is specific to your OpenSearch instance, so you will have to figure it out for your own workflow. It also takes some time for the container to be ready for the first HealthCheck. So, I'd recommend a--start-period
of at least 20 seconds. If you're using any env vars in yourCMD
, remember to define them inservices.<service_name>.environment
, instead ofservices.<service_name>.build.args
, in the docker-compose file as the command is NOT run duringdocker-build
.Add Autoheal as a service running along side Opensearch container. For example to define Autoheal in
docker-compose.yml
:In this example, I use
AUTOHEAL_CONTAINER_LABEL=all
, which means Autoheal will try to restart all unhealthy containers instead of only those withservices.<service_name>.labels.autoheal=true
. I opted not to use the label feature of autoheal because I couldn't make it to work consistently on my Ubuntu workflows (It works fine on my local Mac env). Also note that theAUTOHEAL_START_PERIOD
andAUTOHEAL_DEFAULT_STOP_TIMEOUT
should be greater than the time it takes for first possible unhealthy status to be reported.Add ample sleep time after OpenSearch and Autoheal containers are stood up so that the OpenSearch container can be restarted by Autoheal at least once, to avoid a race condition between these 2 containers and your tests. For this example, I'd use
sleep 60
;This workaround is more involved, but it has the added benefit of also solving other kinds of intermittent failures, not just the one caused by said bug.
I benchmarked both solutions on over 700 jobs each, and they all passed.
The text was updated successfully, but these errors were encountered: